Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-04-12T11:00:00+02:00
End: 2023-04-12T12:20:00+02:00
Location: No location set

11:00 → 11:20

Discussion 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

High priority framework topics:

Problem at end of run with lots of error messages and breaks many calibration runs https://alice.its.cern.ch/jira/browse/O2-3315
- Unfortunately, online there are still errors after fixing those that are seen in the FST (e.g. partition 2e5XVwXYwvX).
Fix START-STOP-START for good
- https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
Async workflow for 1NUMA domain with higher multiplicities gets stuck: https://alice.its.cern.ch/jira/browse/O2-3399
- Problem with workflow sometimes slow due to "oscilations" in the processing understood:
  - Publishing side of oscilations fixed by https://github.com/AliceO2Group/AliceO2/pull/10969, still oscilations due to slow processing (see below)
Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
Support for expendable flag merged in DPL today: https://alice.its.cern.ch/jira/browse/O2-3398
- Problem when QC tries to export the expendable flags: https://alice.its.cern.ch/jira/browse/QC-953 - to be checked between QC and DPL experts.

Other framework tickets:

TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681
After 64k vertex problem fixed, now problem in DebugGUI that it gets stuck at 100% CPU for large workflows https://alice.its.cern.ch/jira/browse/O2-3535 - fixed
Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress.
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future. This problem popped up again, when corrupt TPC data was killing runs due to a missing check. I'll try to have a look and add a sanity check and a protection since Matthias is no longer available.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.

Global calibration topics:

TPC calib problem: Need to repeat the test with latest fixed, but as we still have problems in all online runs, I think it will require more work..

Async reconstruction

Async reco with 1NUMA domain setup:
- Memory leak in async reco was not a real leak, but metrics queuing up since metric processing was too slow. Sped up gy Giulio meanwhile, and also disabled by Chiara.
Now can run with up to 500 CTF files, 1000 CTF files failed since output exceed max allowed size, limit to be raised and retried.
Checked problem with severe serial overhead before / after processing: Actual 1 NUMA domain processing was only 50% of execution time of task:
- Slow metric processing fixed. (Giulio)
- Metric postprocessing by slow jq commands removed. (Chiara)
- AOD merging parallelized (Chiara)
  - Should scale better with more CTF files, since then there'll be more AOD to be merged in parallel. Tried locally with 250 CTF files --> 5 AODs, and expected speedup of 5 seen.
Investigation of remaining oscilations:
- GPU reconstruction sometimes slow. Not fully clear what happens, but GPUs seem sometimes stuck / slow for a while when the server is under high load. E.g. DMA transfers reduce from 30 GB/s to 30 MB/s, and then a transfer can last 20s. - Not clear what triggers this, but no manpower to look into it at the moment.
- Other processes also need excessive time some times:
  - ITS tracking. - fixed by Matteo
  - Secondary vertexing (up to 200s), seems to be a side effect of excessive ITS seeds. - Fixed by Matteo
  - EMCAL QC. - Still waiting for fix
- Problem is that such slow processes cause TFs to queue up in their input, so less TFs are in flight, leading to oscilations in the processing rate.
Async reco performance benchmarks:
- Conducted a comparison of different settings on the EPN, and measured full throughput on an EPN:
- 8-core CPU workflow: 4.81s per TF (cannot run, since runs OOM very quickly)
- 16-core CPU workflow: 4.27s per TF (stable)
- 1GPU workflow: 1.83s (unstable, can run OOM, will get better with shorter TFs)
- 1NUMA workflow: 2.2s (stable)
  - Not clear why the 1NUMA is slower than the 1GPU. Comparison of the processing time shows that the same TFs need more cputime in 1NUMA than in 1GPU, and the workflow is CPU limited. Not yet clear how this can be.

EPN major topics:

EPN OS discussion
- EPNs bumped to Alma 8.7 / ROCm 5.4.2, still no replay from AMD for official support, All Jenkins builders / CI containers / sync / async software versions bumped to new setup.
Fast movement of nodes between async / online without EPN expert intervention.
Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
Check total bytes written on the SSDs, to get an impression how much of the lifetime of the SSDs we have already used, switch to using a ramdisk if necessary: https://alice.its.cern.ch/jira/browse/EPN-198 https://alice.its.cern.ch/jira/browse/EPN-237
- Max 11% lifetime used, can delay deployment of RAMDISK cache to later this year.
- Need to double-check what is expected total bytes written over the next year, since GRID group would like to increase CTF file size, which could prevent the RAM disk cache.
Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
Need DDS/ODC feature to deploy different topologies on EPNs with MI50 and with MI100. ETA End of March + ~2 weeks - Status?
Go to error state if a critical task (e.g. calib task) fails (taking nmin into accound). But currently we do not see failing calibration tasks at all, except for a message in infologger. ODC should go to error, and then ECS should stop the run automatically. Also when n < nmin. ETA End of March - Status?

Other EPN topics:

Deploy DDS with support for non-critical tasks (eg QC): https://alice.its.cern.ch/jira/browse/EPN-131
Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
If DD connection of a node fails, the node should be taken out and count against nmin, otherwise it can make the false impression that the processing on the other nodes is too slow.

EPN farm upgrade:

Conducted a test to (in FST crash):
- Fully synchronize the GPU
- Run a kernel in stream 0
- Run a DMA transfer in stream 0
- Fully synchronize the GPU
- Copy the DMA output to a second buffer
- Repeat the DMA transfer
- Fully synchronize the GPU
- Compare the buffers
This comparison sometimes shows differences, i.e. the first DMA transfer is not retrieving the correct data from the GPU.
Checked manually by adding long sleeps, that the problem is not the synchronization failing to wait for the DMA transfer.
So it really seems there is a problem with device-side synchronization, as we assumed initially. Must be checked by AMD.
Will repeat this test also with the second test case in async reco.

Full system test issues:

Long running full system test (> 6 hours) seems to die due to going out of SHM, could be an SHM leak again, need to check.

Topology generation:

Ole is investigating to use set -u or set -e to catch more errors, but has some drawbacks. Current plan is to use -u, but not -e. To be merged when Ole is back from vacation in 2 weeks. - Status?

Software deployment at P2.

O2PDPSuite updated on Tuesday, went mostly smooth.

Switch to 32 orbit TF:

TOF switched their code to read TF length from GRP, to be checked, now all detectors have adapted.

QC / Monitoring / InfoLogger updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
New logFetcher tool from Ole to gather log files for a run, without logging in to individual EPNs.

CCDB topics.

Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
- https://github.com/AliceO2Group/AliceO2/pull/9992#event-7875018488

AliECS related topics:

ECS should replace newlines in extra env field by spaces, but doesn't. Filed a bug report in JIRA.

GPU ROCm / compiler topics:

Changed build system to allow also binaries for different CUDA architectures in the same build.
Regression in ROCm 5.4.3 still not understood, reported to AMD, but current MI100 problem has priority.
Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.

TPC GPU Processing

Random GPU crashes under investigation.
Problem with GPU tracking with distortions was due to failures from inverse transform map since it is unconstraint. Will continue investigation when the map is fixed.

ANS Encoding

Michael still working on it, STATUS?

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

11:20 → 11:25

TRD Tracking 5m

Speaker: Ole Schmidt (CERN)

11:25 → 11:30

ITS Tracking 5m

Speaker: Matteo Concas (CERN)

Currently: finalising Road Finding
PRs pending:
- none
CHEP plans:
- Road finding can be a realistic and reasonable point to arrive to.
- Understanding reasonably approvable plots for that time.
- Overall, I never see timing performance regression wrt CPU (with no discussion on optimisation), nor exaggerated speedups, there are features which can still be valuable (automatic code generation for HIP, intra TF flexibility, memory size configurability)
TL;DR: ITS GPU Vertexer (TimeFrame "splitting")

Step	on GPU	CPU consistency	Speedup	Optimised	Comments/Improvements	CUDA	HIP
Tracklet Finder	✅	✅	~	🚧		✅	✅
Tracklet Selection	✅	✅	~	🚧		✅	✅
Vertex Fitter	❌	/	/	/	Investigate after tracking	❌	❌

TL;DR: ITS GPU Tracker (TimeFrame "splitting")

Step	on GPU	CPU consistency	Speedup	Optimised	Comments/Improvements	CUDA	HIP
Tracklet Finder	✅	✅ (δ~1‰)	~	🚧	multi-ROF sliding window	✅	✅
Trkl duplicate finder	✅	✅	~	✅	thrust & CUB mainly	✅	✅
Cell finder	✅	🚧	~	🚧		✅	✅
Cell neighbour finder	✅	🚧	~	🚧		✅	✅
Road finder	🚧	🚧	~	🚧		🚧	🚧
Track fitting	❌	❌		❌		❌	❌

11:30 → 11:35

TPC ML Clustering 5m

Speaker: Christian Sonnabend (Heidelberg University (DE))

Updates on the 2D clusterization

Clusterization

Slight issues with extracting the correct time information for found cluster maxima in GPUChainTrackingClusterizer.cxx - Maxima are read out to file but time information is incorrect since it always saves the current local time (typically wrong by a TPC drifttime ~600 units). To be investigated
Build local "dummy-maximizer" trying to just find maxima from the digits themselves by comparing with the surrounding digits and checking for charge ≥ neighbouring charges for all 8 neighbouring charges
~73% of all digit maxima (dummy-clusterizer) can be found in the ideal clusterizer -> Neural network should learn to when there is a maximum and when there is not, using the ideal clusterizer

Neural Networks

PbPb, 50 Events @ 50kHz
CNN trained on 11x11 windows (pad x time) with an output of 0 (no maximum in ideal clusterizer) to 1 (maximum found in ideal clusterizer).
Using a hard cut at 50% probability by the network: Currently ~84% accuracy is reached (accuracy = (TP + TN) / all)
As an easy-to-imagine number: 730k maxima found in digits where a corresponding maximum was found in the ideal clusterizer, then the network produces 740k cluster-maxima (so the output maxima are at least not much more than the found ideal clusters)

--> Further steps:

Clean-up of the training data is needed
Ideal clusterizer is sometimes one digit off from a maximum in the digits (maybe noise?) - to be investigated

--> Benefits:

Choose cutoff probability - Tradeoff between computing resources and accuracy (setting the threshold higher will produce less maxima but also reduce the accuracy for already correctly identified maxima)