Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      PDP Run Coordination 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Reducing overhead from headers:

      • Matthias is working on this: https://alice.its.cern.ch/jira/browse/O2-2395.
      • Regression fixed, PR merged.
      • Next step is to integrate it in DataDistribution?

      Issue with start / stop / start cycle.

      • DPL processes die if ECS does a cycle start / stop / start. Currently this means a new partition must be created for every run: https://alice.its.cern.ch/jira/browse/O2-2559
        • Had joined Debug Session with Giulio Matthias Adam David Federico.
        • Issue reproduced and fixed, but there is another issue that after the second start the readout-proxy doesn't receive data, i.e. 2 runs in same partition still not working.
        • Giulio has added restart functionality to the DebugGUI, so this can be reproduced and investigated locally.

      Proper EOS handling in workflows (current plan):

      • Summarized what we want to do in this JIRA: https://alice.its.cern.ch/jira/browse/O2-2715?filter=-2
      • Giulio will work on it once the more important issues are solved.

      GPU ROCm Issues to be followed up by EPN:

      • EPN team will setup a test script for future validation of ROCm relases
      • Check if ROCm 4.5 fixed the server crashes we had when enabling GPU monitoring
      • Check if ROCm 4.5 fixes the issues that GPU clock speeds are sometimes set to low
      • Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.

      Problem with EPN OS and GPUs:

      • AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
      • ROCm 4.5 fails to register memory when we switch to CentOS 8.5 (8.5 is the final CC8 version, since CC8 is no EOL. If we want to stick to CC8 on the EPNs for some time, perhaps it would make sense to install this latest version now).
      • Inquired with AMD about future support of RHEL clones, waiting for reply.

      New defaults for EPN:

      • EPNs should not use dataflow defaults anymore, since it creates OpenSSL problems with alien access.
      • New default set up and tested locally in build container.
      • Working so far, but we need to test on the CI infrastructure. A problem is that the build executes part of the code, so cross compilation is not easily possible. The `march` setting for Ryzen CPUs could make the build fail on the builders (Tested locally with a Kaby Lake Laptop and that worked fine). To be tested.
      • Not yet active, since there were many updates anyway, and we didn't want to add more complexity. But should be ready to deploy soon.

      Full system test issues:

      • Crashes on EPN dev node (while working on normal EPN node and on my laptop) when loading CTF dictionary. Fixed by Michael, was a bug in ANS encoder (actually 2 bugs fixed, another one was spotted during the process).
      • Detectors not writing raw files with correct links. Cannot run full scale full system test since files are incompatible with Readout. The 2 missing detectors are MFT and PHOS. PHOS opened a PR to fix this today.

      Issues currently lacking manpower, waiting for a volunteer:

      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
      • Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

      InfoLogger messages:

      • Still need to switch default threshold from warning to important.
        • All detector errors that do not need action by the operations should be demoted from error to alarm.
        • Important warnings that we want to see in the InfoLogger should be promoted to alarm.
        • Severity limit for infologger will be raised to state - no more warnings shown.

      Time frame throttling:

      • Not working together with QC (thus unusable for sync reco).
      • Throttling mechanism already extracted to dedicated class, to be used in different places.
      • In order to use it for other cases but the raw-proxy, the ad-hoc setup of the out-of-band FMQ channel must be replaced by a generic out-of-band channel DPL feature.

      Speeding up startup of workflow on the EPN:

      • SHM Management tool fully intergrated by Alexey, DD support added by Gvozden, NUMA Awareness added by David.
      • Conducted PDP processing workflow startup time standalone measurements: 18 seconds for full Pb-Pb reconstruction workflow in 8 GPU setup on an EPN node.
      • Reconstruction workflow, i.e. contains everything except QC and calibration (though adding QC and calibration should increase the time only marginally if at all).
      • Tested with dev branches of O2 and DD with some custom hacks, i.e. currently not reproducible online (particularly, manually made the features requested in R3C-646 and R3C-696 work).
        • One problem in DPL (cannot parse empty string arguments) currently prevents merging the changes in O2. https://alice.its.cern.ch/jira/browse/O2-2757
      • Measurement assumed the same workflow was run before at least once, so that the JSONs / XMLs are cached.
      • This includes the whole startup through all states, i.e. from processes not running --> INITIALIZED --> READY --> RUNNING (i.e. both the Configure and the Start in AliECS).
      • Measured on a single node, but all EPNs start in parallel.
      • From the PDP side, this is more or less as good as it gets. Perhaps we can cut 2 or 3 seconds more, but would expect significant improvements.
      • Next steps:
        • Get R3C-646 and R3C-696 closed.
        • Remeasure in a real partition controlled by AliECS, compare the time, and see where we have delays and loose time since things don't run in parallel.
      • New feature: we must not create the DPL workflow JSON for each individual process (this means we start O(1000) processes each start of run just to create the JSONs. Discussing with the DDS team here: https://alice.its.cern.ch/jira/browse/R3C-696. Anar is implementing a config file feature in DDS: https://github.com/FairRootGroup/DDS/issues/406

      EPN Scheduler

      • Had a quite severe problem that user jobs might leak SHM segments.
        • No easy way to clean them via epilogue script, since it is not clear which segment belongs to which job. It is clear to which user, but a user might have 2 jobs running on the same node.
        • Discussed with FairMQ team.
        • Finally implemented solution based on slurm job_container, which seems to work perfectly now.
      • Next step: Add VObox for GRID

      Important framework features to follow up:

      • Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
      • Bug: ROOT writer storing TFs to out of sync entries in the ROOT file if DPL pipelines are used.
        • Fix via completion policy in https://github.com/AliceO2Group/AliceO2/pull/7536, but needs additional framework support.
      • Suppress default options when generating DDS command lines: https://alice.its.cern.ch/jira/browse/O2-2736 - will drop this, not needed once we cache the DPL JSON topology.
      • Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
      • Backpressure reporting when there is a only 1 input channel: no progress.
      • (Not for this week) multi-threaded pipeline: no progress.
      • (Not for this week) Problem with forwarding of multi-spec output: mid-entropy encoder receives the TF twice: no progress.
      • Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
      • Output proxy spinning at 100%: no progress.

      AMD / GPU stability issues in FST:

      • ROCm 4.5 released. Result of tests of open issues:
        • Random server crashes with error reported in amdgpu kernel module in dmesg: Not fully fixed, had at least one crash now, but with different dmesg message compared to before. Not clear if same issue or something different (or a hardware error, happened only once, but have only one test node so far).
        • Random crash with noisy pp data: Disappeared with ROCm 4.5. Cannot reproduce it anymore. Was never understood. Hopefully it was a bug in previous ROCm that was fixed by chance in 4.5. Closed now.
        • Random crash processing Pb-Pb data: still there, but happens similarly as in ROCm 4.3, thus no regression in 4.5. Need to debug further what exactly happens and then report to AMD.
        • Error with memory registration: fixed.

      GPU Performance issues in FST

      • One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.

      Minor open points:

      • https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
      • https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
      • https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
      • https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
      • https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
      • https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
      • DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
      • Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.

      Workflow repository:

      • Detectors would like the QC JSON files to be fetched from consul. Discussed with Barth. Originally we wanted this to be versioned, but versioning will still take a while. We will add version information to AliECS / O2DPG, but this will not be used for fetching, i.e. after updating QC JSONs, the version must be counted up to invalidate the workflow cache.
      • Need better way to specify O2PDPSuite / QC JSON version. My proposal was to have extra expert fields in AliECS, similarly as for the Workflow repository. Alternatively, could use CCDB. Will be discussed next Wednesday in RCWM.
        • Discussed at RC Weekly Meeting. We will have separate settings for O2DPG / O2PDPSuite / QC JSONs in the AliECS EPN/PDP expert panel. To be implemented by David and FLP team. ETA ~1 week.

      Detector status:

      • EMCAL errors (no crash, just messages, EMCAL is working on it).
    • 11:20 11:40
      Software for Hardware Accelerators 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      ITS GPU Tracking and Vertexing:

      • Matteo will now continue with ITS GPU tracking.

      TRD Tracking

      • Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
      • Will have a joint session with Ole and David on Friday to progress with the TRD GPU tracking.