ALICE GPU Meeting

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (news from this week: blue, news from last week: purple, no news: black)

      Full system test status

      • NUMA Imbalance Problem: fixed, FairMQ memory registration callbacks can go to DPL chain in other NUMA domain first, fixed by mlocking the memory once in the allocating process before sending a callback.
      • EMCAL processing speed: fixed.
      • TRD: Found problem with tracklets from raw data. Still waiting for fix from SEAN.

      Problems during operation

      • Improvements for memory registration callbacks merged, in FST callbacks are coming immediately. Still need to implement waiting for all callbacks before going to READY state.

      AMD / GPU stability issues:

      • Compiler fails with internal error when optimization is disabled (-O0): Shall be fixed in ROCm 4.4. Waiting to receive a preliminary patch to confirm it is fixed.
      • Reported new internal compiler error when we enable log messages in GPU management code. Shall be fixed in ROCm 4.4.Waiting to receive a preliminary patch to confirm it is fixed.
      • FST random startup failure during GPU memory registration: fixed
      • ROCm >=4.2 fails to start FST with memory error. fixed
      • New problem with ROCm 4.3 on test server: FST segfaults randomly or processes exit with exception. Looks like memory corruption in the FST. Server sometimes dies with kernel crash log hinting to amdgpu kernel module. Under investigation.
        • Have a workaround to disable GPU memory registration and doing PIO transfer instead of DMA transfer. Overwall performance decrease is  ~15%.

      Crashes in cosmiscs data reconstruction:

      • Merged one more fix for cosmics processing, and added option to manually set max number of clusters for noisy rows (since impossible to estimate from the ZS metadata). Jens wants to run another large-scale test.

      GPU Performance issues

      • One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
      • Performance of 32 GB GPU with 128 orbit TF less than for the 70 orbit TF we tested in August. Results a bit fluctuating, but average is between 1600 and 1650 GPUs (compared to 1475 GPUs for 70 orbit TF). Matteo has implemented a first version of the benchmark, it is currently running on the EPN.
      • New 6% performance regression with ROCm 4.3.

      Issues on EPN farm affecting FST:

      • Network problem between containers --> connection aborts, failures to check out git, etc. Can be circumvent using ipoib instead of the ethernet connection.
      • AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8: There is some work in progress, as the problem seems minor. but not yet clear whether CS8 will be supported officially.


      Important general open points:

      • Avoid reparsing of TPC input buffers in TPC reco workflow. Matthias is working on it now.
      • Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Partially merged, disabled at compile time, will follow-up when Giulio returns.
      • https://alice.its.cern.ch/jira/browse/QC-569 : Processors should not be serialized when they access the same input data. (reported by Jens for QC, but relevant for FST.)  on-hold: needs new FairMQ features, extensive discussions in https://alice.its.cern.ch/jira/browse/ALFA-15.
        • This feature might have some interplay with the "optional" type of the raw data messages we use in order to handle missing detectors. Might need some additional work in DPL.
      • Chain getting stuck when SHM buffer runs full: on-hold: long discussion last week but no conclusion yet. All discussed solutions require knowledge how many TFs are in flight globally on a processing node, which is not yet available in DPL. --> Implement this first, then see further.


      Open minor DPL-related (or FairMQ) issues:


      Issues with detectors:

      • EMCAL errors (no crash, just messages, EMCAL is working on it).

      Status of remaining detectors:

      • MCH integrated in FST.
      • V0 reconstruction work in progress, to be ready in July.

      Issues currently lacking manpower, waiting for a volunteer:

      • Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108