Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-04-26T11:00:00+02:00
End: 2023-04-26T12:20:00+02:00
Location: No location set

11:00 → 11:20

Discussion 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

High priority framework topics:

Problem at end of run with lots of error messages and breaks many calibration runs https://alice.its.cern.ch/jira/browse/O2-3315
- Unfortunately, online there are still errors after fixing those that are seen in the FST (e.g. partition 2e5XVwXYwvX).
Fix START-STOP-START for good
- https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
Problem with QC topologies with expendable tasks - calling for a meeting to discuss what requirements we should have for such tasks / what to check during topology generation level: https://alice.its.cern.ch/jira/browse/QC-953
Improve DPL backpressure reporting: https://alice.its.cern.ch/jira/browse/OMON-666
Investigated problem with high CPU load of idling processes, fixed 3 causes:
- DPL metric processing.
- DPL spinning at beginning of run before receiving first message.
- DPL spinning on channels that are backpressured, this one was the one that cause the high load during async 1NUMA workflow.

Other framework tickets:

TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681 - Saw some work, is it fixed?
After 64k vertex problem fixed, now problem in DebugGUI that it gets stuck at 100% CPU for large workflows https://alice.its.cern.ch/jira/browse/O2-3535 - fixed
Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress.
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future. Fixed - and in addition special handling added for data with incorrect padding: will perform a forward page scan to find all sequences. This slows down the processing, but avoids crash / incorrect decoding. An alarm is printed on the infologger that data has pad padding and that thus the processing is slow.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.

Global calibration topics:

TPC calib problem: Need to repeat the test with latest fixed, but as we still have problems in all online runs, I think it will require more work..
Switched TPC IDC FLP2EPN sending for 32 orbits, should be the last change on the software side.
Disabled per-TF logging of calib tasks, since it creates way too large log files (Up to 100 GB per run per calib node).
- Will implement automatic downscaling of DPL process reporting as well.
- If detectors need logging, they must implement proper downscaling. Will announce to them in an e-mail.

Async reconstruction

Investigation of remaining oscilations:
- GPU reconstruction sometimes slow. Not fully clear what happens, but GPUs seem sometimes stuck / slow for a while when the server is under high load. E.g. DMA transfers reduce from 30 GB/s to 30 MB/s, and then a transfer can last 20s. - Not clear what triggers this, but no manpower to look into it at the moment.
- Other processes also need excessive time some times:
  - ITS tracking. - fixed by Matteo
  - Secondary vertexing (up to 200s), seems to be a side effect of excessive ITS seeds. - Fixed by Matteo
  - EMCAL QC. - fixed
- Problem is that such slow processes cause TFs to queue up in their input, so less TFs are in flight, leading to oscilations in the processing rate.
Improvements / performance impacts from oscilations in 1NUMA workflow:
- Rate smoothing mitigates the problem, and improves overall throughput by 3.4%.
- 1NUMA workflow with TF throttling and smoothing now reaches 82% CPU load (Due to HyperThreading, 50% CPU load is 80% of the capacity, so with 82% load we are at 92% of the max compute capacity, and 1.9s per TF.
- For comparison, 1NUMA workflow with manually tuned publishing rate, but without rate throttling, reaches 95% CPU load, and 1.75s per TF, so 8.5% faster still.
  - Should still try to undertstand and improve this.
Updated async reco performance benchmarks after CPU load fix:
- 8-core CPU workflow: 4.81s per TF (cannot run, since runs OOM very quickly)
- 16-core CPU workflow: 4.27s per TF (stable)
- 1GPU workflow: 1.83s (unstable, can run OOM, will get better with shorter TFs)
- 1NUMA workflow: 1.75s (stable)

EPN major topics:

Switched CTF size to 10 GB - Async reco will use direct streaming from alien instead of copying to local files.
- SSD should survive this during Run 3, afterwards we might need to replace them.
- Deploying the RAMDISK cache is thus no longer needed.
Fast movement of nodes between async / online without EPN expert intervention.
Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
Need DDS/ODC feature to deploy different topologies on EPNs with MI50 and with MI100. ETA End of March + ~2 weeks - Status?
Go to error state if a critical task (e.g. calib task) fails (taking nmin into accound). But currently we do not see failing calibration tasks at all, except for a message in infologger. ODC should go to error, and then ECS should stop the run automatically. Also when n < nmin. ETA End of March - Status?
Lubos to provide interface to querry current EPN SHM settings.

Other EPN topics:

Deploy DDS with support for non-critical tasks (eg QC): https://alice.its.cern.ch/jira/browse/EPN-131
Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
If DD connection of a node fails, the node should be taken out and count against nmin, otherwise it can make the false impression that the processing on the other nodes is too slow.

EPN farm upgrade:

Repeated the test with repeated DMA transfer for the second test case, fails in the very same way.
Problem disappears when amdgpu-function-calls are disabled, but this comes with a performance penalty.
Will have a meeting with AMD on Friday to discuss how to proceed.

Full system test issues:

Long running full system test (> 6 hours) seems to die due to going out of SHM, could be an SHM leak again, need to check.

Topology generation:

Ole is investigating to use set -u or set -e to catch more errors, but has some drawbacks. Current plan is to use -u, but not -e. To be merged when Ole is back from vacation in 2 weeks. - Status?
Should test to deploy topology with DPL driver, to have the remote GUI available.

Software deployment at P2.

O2PDPSuite updated on Tuesday, went smooth.
RC and some detectors requested new SYNTHETIC run dataset, discussed with Mezut how to have more realistic number of TPC clusters. Simulating afterglow will be very slow and need lots of memory, but can be done on EPNs. Hope to create new dataset next week.

QC / Monitoring / InfoLogger updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
New logFetcher tool from Ole to gather log files for a run, without logging in to individual EPNs.

CCDB topics.

Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
- https://github.com/AliceO2Group/AliceO2/pull/9992#event-7875018488

AliECS related topics:

ECS should replace newlines in extra env field by spaces, but doesn't. Filed a bug report in JIRA.
In preparation for supporting configurable SHM sizes, added SHM size options to ECS EPN workflow configuration. Will be passed to topology generation. Then to be checked against actual EPN SHM size settings.

GPU ROCm / compiler topics:

Changed build system to allow also binaries for different CUDA architectures in the same build.
Regression in ROCm 5.4.3 still not understood, reported to AMD, but current MI100 problem has priority.
Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.

TPC GPU Processing

Random GPU crashes under investigation.
Sergey provided fix for inverse transformation map, but not yet merged in O2. Alex is also checking. Will continue checks for TPC tracking with distortions when I find time.
Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
Bug in TPC Clusterizer in multi-threaded CPU DLBZS decoding, Felix is investigating.

ANS Encoding

PR open, currently fails to compile on MacCI due to different definition of some types.
Need to clarify how to detect SSE / AVX features to automatically enable vectorized processing. Idea is to use the CMake Vc checks, which we already have in O2 anyway, and which check the compile flags.

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

11:20 → 11:25

TRD Tracking 5m

Speaker: Ole Schmidt (CERN)

11:25 → 11:30

TPC ML Clustering 5m

Speaker: Christian Sonnabend (Heidelberg University (DE))

11:30 → 11:35

ITS Tracking 5m

Speaker: Matteo Concas (CERN)

Updates: getting some numbers for CHEP talk
- Comparison CPU/GPU of what is ported so far (full vertexer, tracker up to cell finding)
  - AMD7950X/Titan X
  - EPYC/MI50
- Elapsed time scaling vs available memory
  - Both TITAN and MI50, up to their memory size
- Throughput would be more meaningful
  - Not fair until final output validation (for tracker)
  - More complex to evaluate, not realistic in the current timescale