Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      PDP Run Coordination 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Reducing overhead from headers:

      • Matthias is working on this: https://alice.its.cern.ch/jira/browse/O2-2395.
      • Had to revert one PR, since it broke the FST with DD.

      Issue with start / stop / start cycle.

      • DPL processes die if ECS does a cycle start / stop / start. Currently this means a new partition must be created for every run: https://alice.its.cern.ch/jira/browse/O2-2559
        • Updated monitoring to latest version, problem still remains, debug session scheduled for Thursday morning in the milestone week with Adam, Giulio, David.

      Proper EOS handling in workflows (current plan):

      • Summarized what we want to do in this JIRA: https://alice.its.cern.ch/jira/browse/O2-2715?filter=-2
      • Giulio will work on it once the more important issues are solved.

      Problem with EPN OS and GPUs:

      • AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
      • ROCm 4.5 fails to register memory when we switch to CentOS 8.5 (8.5 is the final CC8 version, since CC8 is no EOL. If we want to stick to CC8 on the EPNs for some time, perhaps it would make sense to install this latest version now).
      • Inquired with AMD about future support of RHEL clones, waiting for reply.
      • EPN team will setup a test script for future validation of relases, they will validate ROCm 4.5 and then install it on the nodes (probably after the MW only). Another update will come once AMD has fixed the issue with CC 8.5. No clear plan yet to switch to CS8 or to another RedHat clone.

      New defaults for EPN:

      • EPNs should not use dataflow defaults anymore, since it creates OpenSSL problems with alien access.
      • We want to use -O3 / -march for the EPN builds, thus we have create `defaults-o2-epn.sh`, which is a clone of `defaults-o2.sh` but with different CFLAGS. To be tested (on my todo-list), then we switch the nightly EPN builds to these defaults.

      GPU issues:

      • Check if ROCm 4.5 fixed the server crashes we had when enabling GPU monitoring : EPN will check once they have established a validation infrastructure.
      • Check if ROCm 4.5 fixes the issues that GPU clock speeds are sometimes set to low. EPN will check once the farm is fully transitioned to ROCm 4.5..
      • Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.

      Full system test issues:

      • Crashes on EPN dev node (while working on normal EPN node and on my laptop) when loading CTF dictionary.

      Issues currently lacking manpower, waiting for a volunteer:

      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
      • Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

      InfoLogger messages:

      • Still need to switch default threshold from warning to important.
        • All detector errors that do not need action by the operations should be demoted from error to alarm.
        • Important warnings that we want to see in the InfoLogger should be promoted to alarm.
        • Severity limit for infologger will be raised to state - no more warnings shown.

      Time frame throttling:

      • Not working together with QC (thus unusable for sync reco).
      • Throttling mechanism already extracted to dedicated class, to be used in different places.
      • In order to use it for other cases but the raw-proxy, the ad-hoc setup of the out-of-band FMQ channel must be replaced by a generic out-of-band channel DPL feature.

      Speeding up startup of workflow on the EPN:

      • Improved and parallel GPU memory registration fully working with new FairMQ / ROCm 4.5. Waiting for Monday to deploy new software on the EPNs. Then we have to update O2 once more. Will reduce the time for GPU registration (not including allocation / locking) from 90s to ~5 s.
      • SHM management tool.

        • First PR by Alexey available, to be tested by me, looks good so far.

      • New feature: we must not create the DPL workflow JSON for each individual process (this means we start O(1000) processes each start of run just to create the JSONs. Discussing with the DDS team here: https://alice.its.cern.ch/jira/browse/R3C-696. Anar is implementing a config file feature in DDS: https://github.com/FairRootGroup/DDS/issues/406

      EPN Scheduler

      • Presented at RC meeting yesterday, no complaints yet.
      • Had a quite severe problem that user jobs might leak SHM segments.
        • No easy way to clean them via epilogue script, since it is not clear which segment belongs to which job. It is clear to which user, but a user might have 2 jobs running on the same node.
        • Discussed with FairMQ team.
        • Finally implemented solution based on slurm job_container, which seems to work perfectly now.

      Important framework features to follow up:

      • Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
      • Bug: ROOT writer storing TFs to out of sync entries in the ROOT file if DPL pipelines are used.
        • Fix via completion policy in https://github.com/AliceO2Group/AliceO2/pull/7536, but needs additional framework support.
      • Suppress default options when generating DDS command lines: https://alice.its.cern.ch/jira/browse/O2-2736 - will drop this, not needed once we cache the DPL JSON topology.
      • Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
      • Backpressure reporting when there is a only 1 input channel: no progress.
      • (Not for this week) multi-threaded pipeline: no progress.
      • (Not for this week) Problem with forwarding of multi-spec output: mid-entropy encoder receives the TF twice: no progress.
      • Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
      • Output proxy spinning at 100%: no progress.

      AMD / GPU stability issues in FST:

      • ROCm 4.5 released. Result of tests of open issues:
        • Random server crashes with error reported in amdgpu kernel module in dmesg: Not fully fixed, had at least one crash now, but with different dmesg message compared to before. Not clear if same issue or something different (or a hardware error, happened only once, but have only one test node so far).
        • Random crash with noisy pp data: Disappeared with ROCm 4.5. Cannot reproduce it anymore. Was never understood. Hopefully it was a bug in previous ROCm that was fixed by chance in 4.5. Closed now.
        • Random crash processing Pb-Pb data: still there, but happens similarly as in ROCm 4.3, thus no regression in 4.5. Need to debug further what exactly happens and then report to AMD.
        • Error with memory registration: fixed.
        • Errors with logging / monitoring / GPU clocks: currently being followed up by EPN team.

      GPU Performance issues in FST

      • One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.

      Minor open points:

      • https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
      • https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
      • https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
      • https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
      • https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
      • https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
      • DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
      • Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.

      Workflow repository:

      • Detectors would like the QC JSON files to be fetched from consul. Discussed with Barth. Originally we wanted this to be versioned, but versioning will still take a while. We will add version information to AliECS / O2DPG, but this will not be used for fetching, i.e. after updating QC JSONs, the version must be counted up to invalidate the workflow cache.
      • Need better way to specify O2PDPSuite / QC JSON version. My proposal was to have extra expert fields in AliECS, similarly as for the Workflow repository. Alternatively, could use CCDB. Will be discussed next Wednesday in RCWM.

      Detector status:

      • EMCAL errors (no crash, just messages, EMCAL is working on it).
    • 11:20 11:40
      Software for Hardware Accelerators 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      ITS GPU Tracking and Vertexing:

      • Matteo will now continue with ITS GPU tracking.

      TRD Tracking

      • Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
      • Will have a joint session with Ole and David to progress with the TRD GPU tracking next week.

      ANS Encoding

      • Michael's PR with the new renorming merged, no issues so far.