Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-03-29T11:00:00+02:00
End: 2023-03-29T12:20:00+02:00
Location: No location set

11:00 → 11:20

Discussion 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

High priority framework topics:

Problem at end of run with lots of error messages and breaks many calibration runs https://alice.its.cern.ch/jira/browse/O2-3315
- Understood the problem with the reproducer in the FST which Ole provided end of last year. Fully fixed by https://github.com/AliceO2Group/AliceO2/pull/10940.
- Unfortunately, online there are still errors. Tried in staging yesterday (e.g. partition 2e5XVwXYwvX), and there are still different types of errors.
- Since EndOfStream is inherently unreliable for final calib uploads, had a discussion yesterday how to do this in a more controlled way in the future.
  - Created some documentation here: https://alice.its.cern.ch/jira/browse/O2-3628
  - Giulio provides a new check whether STOP is requested: https://github.com/AliceO2Group/AliceO2/pull/10948
  - Ruben to create example task to follow the new scheme, then all detectors should adapt.
Fix START-STOP-START for good
- https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
- Checking the CALIB workflows at EoR, several issues were fixed along the line:
  - Some states not reset, fixed by https://github.com/AliceO2Group/AliceO2/pull/10940
  - https://github.com/AliceO2Group/AliceO2/pull/10942 merged to keep counters in sync, deployed at P2, no problems seen.
Async workflow for 1NUMA domain with higher multiplicities gets stuck: https://alice.its.cern.ch/jira/browse/O2-3399
- Problem with async workflow getting stuck fully fixed.
  - This problem fixed also the problem with the workflow sometimes being slow due to late arrival of the time frame throttling feedback.
- Problem with workflow sometimes slow due to "oscilations" in the processing understood:
  - Publishing side of oscilations fixed by https://github.com/AliceO2Group/AliceO2/pull/10969, still oscilations due to slow processing (see below)
Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
Support for expendable flag merged in DPL today: https://alice.its.cern.ch/jira/browse/O2-3398
- Should do a test at P2 with expendable QC task.

Other framework tickets:

After 64k vertex problem fixed, now problem in DebugGUI that it gets stuck at 100% CPU for large workflows https://alice.its.cern.ch/jira/browse/O2-3535
Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress.
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.

Global calibration topics:

TPC calib problem: Need to repeat the test with latest fixed, but as we still have problems in all online runs, I think it will require more work..

Async reconstruction

Async reco with 1NUMA domain setup:
- Test with latest updates to be started by Costin today.
- Current TF throughput of the processing only (neglecting all overhead)
  - ~13.5s per TF in 1GPU workflow running on full NUMA domain
  - ~6.5s in latest test of 1NUMA workflow (250 CTF files)
  - ~5.25s seconds in my latest local test.
- Memory leak in async reco was not a real leak, but metrics queuing up since metric processing was too slow. Sped up gy Giulio meanwhile, and also disabled by Chiara.
Now can run with up to 500 CTF files, 1000 CTF files failed since output exceed max allowed size, limit to be raised and retried.
Problem with CCDB pulling in libUV, breaking ROOT since headers not available. Broke async reco with latest O2 tag. Fixed by Giulio: https://github.com/AliceO2Group/AliceO2/pull/10954.
Checked problem with severe serial overhead before / after processing: Actual 1 NUMA domain processing was only 50% of execution time of task:
- Slow metric processing fixed. (Giulio)
- Metric postprocessing by slow jq commands removed. (Chiara)
- AOD merging parallelized (Chiara)
  - Should scale better with more CTF files, since then there'll be more AOD to be merged in parallel. Tried locally with 250 CTF files --> 5 AODs, and expected speedup of 5 seen.
Investigation of remaining oscilations:
- GPU reconstruction sometimes slow. Not fully clear what happens, but GPUs seem sometimes stuck / slow for a while when the server is under high load. E.g. DMA transfers reduce from 30 GB/s to 30 MB/s, and then a transfer can last 20s.
  - Seen TPC tracking last up to 250s in such cases, compared to normal 3-4s. GPUs return to normal state with the next TF.
  - Not data-driven / not reproducible running the same TF.
  - Should be investigated and perhaps reported to AMD.
- Other processes also need excessive time some times:
  - ITS tracking.
  - Secondary vertexing (up to 200s), seems to be a side effect of excessive ITS seeds.
  - EMCAL QC.
- Problem is that such slow processes cause TFs to queue up in their input, so less TFs are in flight, leading to oscilations in the processing rate.

EPN major topics:

EPN OS discussion
- EPNs bumped to Alma 8.7 / ROCm 5.4.2, still no replay from AMD for official support, All Jenkins builders / CI containers / sync / async software versions bumped to new setup.
Fast movement of nodes between async / online without EPN expert intervention.
Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
Check total bytes written on the SSDs, to get an impression how much of the lifetime of the SSDs we have already used, switch to using a ramdisk if necessary: https://alice.its.cern.ch/jira/browse/EPN-198 https://alice.its.cern.ch/jira/browse/EPN-237
- Max 11% lifetime used, can delay deployment of RAMDISK cache to later this year.
Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
Need DDS/ODC feature to deploy different topologies on EPNs with MI50 and with MI100. ETA End of March + ~2 weeks
Go to error state if a critical task (e.g. calib task) fails (taking nmin into accound). But currently we do not see failing calibration tasks at all, except for a message in infologger. ODC should go to error, and then ECS should stop the run automatically. Also when n < nmin. ETA End of March

Other EPN topics:

Deploy DDS with support for non-critical tasks (eg QC): https://alice.its.cern.ch/jira/browse/EPN-131
Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
If DD connection of a node fails, the node should be taken out and count against nmin, otherwise it can make the false impression that the processing on the other nodes is too slow.

EPN farm upgrade:

Identified the crashes on MI100 to be 2 independent problems:
- Track in TPC track model encoding
- Corrupt data outputed to host that makes ITS TPC matching crash due to out of bound cluster indizes.
In both cases, the problem seems to be incorrect synchronization between kernel calls and DMA transfers. Synchronization is requested correctly by AliceO2, but not respected by the GPU / the kernel driver / the ROCm runtime. AMD needs to investigate.

Full system test issues:

Full system test back to stable, had to revert one more MCH PR that broke it (for unclear reasons, Laurent is investigating).
FST with pp data crashes on the EPNs in digitization phase due to alignment problem of DCS CCDB object. Could be a ROOT bug. Ruben is investigating.

Topology generation:

Ole is investigating to use set -u or set -e to catch more errors, but has some drawbacks. Current plan is to use -u, but not -e. To be merged when Ole is back from vacation in 2 weeks.

Software deployment at P2.

O2PDPSuite updated on Monday, some failures necessitated another update.
- Check for duplicate outputs in topology was added, but MCH reco had duplicate outputs.
- Severe performance regression in CTF code due to accidentally committed -O0 compile flag.
- FST crashes with O2/dev on .tf files, works on .raw files, since runNumber 0 of raw .tf MC files not supported. Fixed.

Switch to 32 orbit TF:

TOF switched their code to read TF length from GRP, to be checked, now all detectors have adapted.
Provided new 32 orbit SYNTHETIC datasets for P2, currently Pb-Pb only since pp simulation fails.

QC / Monitoring / InfoLogger updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.

CCDB topics.

Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
- https://github.com/AliceO2Group/AliceO2/pull/9992#event-7875018488

AliECS related topics:

Improve error message in AliECS GUI for EPN related failures. PDP error messages are sent via ODC in the Run reply, e.g. for topology generation failures, but ECS does not show them, but only shows generic "EPN Partition Initialize Failed" https://alice.its.cern.ch/jira/browse/OCTRL-734
- ODC / Topology generation error messages now shown in the ECS GUI. Though GUI is a bit ugly and text is a bit convoluted with other content. Vasco is aware and they will clean it up in the next releases.

GPU ROCm / compiler topics:

Changed build system to allow also binaries for different CUDA architectures in the same build.
Regression in ROCm 5.4.3 still not understood, reported to AMD, but current MI100 problem has priority.
Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.

TPC GPU Processing

Random GPU crashes under investigation.
Investigating GPU tracking with distortion corrections. Something is clearly wrong already during the seeding. Single high Pt tracks are sometimes not / only partially found.

ANS Encoding

Michael working on PR, discussion in JIRA, looks good so far, hopefully have first version end of the week.

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

11:20 → 11:25

TRD Tracking 5m

Speaker: Ole Schmidt (CERN)

11:25 → 11:30

ITS Tracking 5m

Speaker: Matteo Concas (CERN)

Currently: developing Road finding (traverse the tree of neighbour cells and create all candidated topologies)
- Traversing the tree is recursive function, no easy prediction on number of combinations
- Operations with indexes (Cells), reduced amount of candidates at this point
- Dryrun, fill index table, rerun with known starting points
PRs pending:
- https://github.com/AliceO2Group/AliceO2/pull/10955
CHEP plans:
- Road finding can be a realistic and reasonable point to arrive at.
- Understanding reasonably approvable plots for that time
- Overall I never see timing performance regression wrt CPU (with no discussion on optimisation), nor exaggerated speedups, there are features which can still be valuable (automatic code generation for HIP, intra TF flexibility, memory size configurability)
TL;DR: ITS GPU Vertexer (TimeFrame "splitting")

Step	on GPU	CPU consistency	Speedup	Optimised	Comments/Improvements	CUDA	HIP
Tracklet Finder	✅	✅	~	🚧		✅	✅
Tracklet Selection	✅	✅	~	🚧		✅	✅
Cell finder	❌	/	/	/	Investigate after tracking	❌	❌

TL;DR: ITS GPU Tracker (TimeFrame "splitting")

Step	on GPU	CPU consistency	Speedup	Optimised	Comments/Improvements	CUDA	HIP
Tracklet Finder	✅	✅ (δ~1‰)	~	🚧	multi-ROF sliding window	✅	✅
Trkl duplicate finder	✅	✅	~	✅	thrust & CUB mainly	✅	✅
Cell finder	✅	🚧	~	🚧		✅	✅
Cell neghbour finder	✅	🚧	~	🚧		✅	✅
Road finder	🚧	🚧	~	🚧		🚧	🚧
Track fitting	❌	❌		❌		❌	❌

11:30 → 11:35

TPC ML Clustering 5m

Speaker: Christian Sonnabend (Heidelberg University (DE))

- Ideal clusterizer is done. Center-of-Gravity is calculated on-the-fly for digits which belong to the same cluster. Maxima and Center-of-Gravity are calculated for all digits with the same MC label and in a (time, pad)-window of (16,6) to find single maxima for looper tracks
- Post-processing for the neural network: Remove points where (maximum_pad-CoG_pad + maximum_time-CoG_time) > 2. This is done not to confuse the network (outlier-removal) and it anyway captures most of the distribution of points.
- Efficiency (total maxima found by clusterizer / total maxima found by ideal clusterizer), fake-rate (how many maxima were found by clusterizer which are not found in the ideal clusterizer) and clone-rate (how many maxima are found by ideal clusterizer for which more than one maximum is found by real clusterizer) of maxima is important and ongoing.
- Maximum finding in 3D might improve results: Needs some thought on how to pass values to network (pad offsets and crossing sectors / ROCs)