Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      PDP Run Coordination 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Event Display Commissioning

      • Purchase of ED PC in progress, have an old NVIDIA PC as fallback for the time being.

      Problems during operation

      Issues on EPN farm affecting PDP:

      • AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.

      Issues currently lacking manpower, waiting for a volunteer:

      • Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
        • Becomes an urgent topic now, most complicated part will be integration with EPN control. To be discussed when Andreas is back from Vacation.
      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108

      Workflow repository

      • Waiting for AliECS to implement new fields in the GUI.
      • Need new ODC version before we can make the O2 version selectable (currently fixed to the latest)

      EPN DPL Metric monitoring:

      • Johannes has added the required parts to the telegraph configuration.
      • Tested and seems working.
      • CPU and Memory metrics missing, after discussing with Giulio we have to add another command line option to add them.
      • Currently sending too many metrics, metric size must be reduced before this can be used in production. Discussed in Jira here: https://alice.its.cern.ch/jira/browse/O2-2583
    • 11:20 11:40
      Full System Test 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Full system test status

      • Full system test with 2 NUMA domains not terminating correctly. DD currently forwards EOS only to one consumer: https://alice.its.cern.ch/jira/browse/O2-2375. This is also a problem for some workflows on the EPN. We should think of a general solution.
        • Discussed possible solutions. Needs a new FairMQ feature, which is already in development: https://github.com/FairRootGroup/FairMQ/issues/384

      AMD / GPU stability issues in FST:

      • Compiler fails with internal error when optimization is disabled (-O0): Shall be fixed in ROCm 4.4. Waiting to receive a preliminary patch to confirm it is fixed.
      • Reported new internal compiler error when we enable log messages in GPU management code. Shall be fixed in ROCm 4.4.Waiting to receive a preliminary patch to confirm it is fixed.
      • Obtained a workaround for the application crash, now waiting for a set of RPMs that contains all fixes together.

      GPU Performance issues in FST

      • One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
      • Performance of 32 GB GPU with 128 orbit TF less than for the 70 orbit TF we tested in August. Results a bit fluctuating, but average is between 1600 and 1650 GPUs (compared to 1475 GPUs for 70 orbit TF). Matteo has implemented a first version of the benchmark, it is currently running on the EPN.
      • New 6% performance regression with ROCm 4.3.

      Important general open points:

      • Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
      • https://alice.its.cern.ch/jira/browse/QC-569 : Processors should not be serialized when they access the same input data. (reported by Jens for QC, but relevant for FST). Implemented in FairMQ, Giulio is implementing it on the DPL side. Highest priority now. First implementation will use an 'early forward' approach. This might cause troubles when the component that would forward data creates optional outputs. Thus for now we will disable it in that case, and throw a warning that it is not efficient.
      • Chain getting stuck when SHM buffer runs full: on-hold: long discussion last week but no conclusion yet. All discussed solutions require knowledge how many TFs are in flight globally on a processing node, which is not yet available in DPL. We have discussed a possible implementation, Matthias will create a Jira ticket detailing the implementation. In short: every sink without an output will send a dummy output, making the dummy-dpl-sink an implicit final object in the processing graph, which knows when a TF has finished processing. This information will be fed back into the raw data proxy (to be discussed how), which will delay injecting new TFs in the chain while the number in flight is above a certain threshold. In case this is realised via metrics, the difference must be build at the source to be synchronous.

      Minor open points:

      Detector status:

      • EMCAL errors (no crash, just messages, EMCAL is working on it).
      • V0 still lacking reconstruction
    • 11:40 12:00
      Software for Hardware Accelerators 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Report on ANS encoding progress

      • Work on the renormalization for the ANS ongoing, aiming to have a presentation with quantitative numbers (memory usage, compression / decompression speed, compression factor wrt. entropy) in the next 2 weeks.

      Report on GPU memory micro benchmark progress

      • Still preparing a presentation with the results, to be shown in this meeting. Currently working on an ITS issue that needs fixing for the Pilot beam, will continue on the microbenchmark afterwards.

      ITS GPU Tracking and Vertexing:

      • PR merged that fixes the GPU vertexing, will continue work on the tracking after the memory micro benchmarks are implemented.

      TRD Tracking

      • Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
      • David and Ole need to sit together to commission the TPC-TRD tracking on GPUs (after the vacation)
      • First version of the refit implemented, not checked yet by Ruben / David.

      GPU monitoring

      • Nothing happened yet, Johannes will follow this up with Alexander.