Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 AM 11:20 AM
      PDP Run Coordination 20m
      Speakers: David Rohr (CERN) , Ole Schmidt (CERN)

      Color code: (important, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Event Display Commissioning

      • Order for ED PC out.

      Problems during operation

      • TPC GPU processing crashing regularly since we updated to ROCm 4.3.
        • This killed every global run / most TPC standalone runs so far.
        • Not clear why it didn't occur in some TPC standalone runs, seems partially data driven.
        • Over night I was able to reproduce it in a standalone data replay run with ROCm 4.3, will now retest the same setup with ROCm 4.1, to see whether it is a regression in ROCm or something else.
          Test will run now for 24-48 hours. If it is stable with ROCm 4.1 a regression in 4.3 is very likely and we probably need to downgrade to 4.1 since AMD probably needs some time O(week(s)) to fix it
        • How do we proceed? Downgrade ROCm again? That would require:
          • EPN needs to go back from CentOS 8.4 to 8.3, since ROCm 4.1 is not compatible to 8.4 (while 4.3 is not compatible to 8.3).
          • EPN would need to deploy the kernel module patch for 4.1 manually on all nodes (otherwise we are back to the situation where the nodes die randomly at start/stop, which we had before, which was fixed by 4.3).
          • The old build container with 4.1 cannot build the current O2 software, but I can build a special build container only for that compatible with the current O2 build, and with ROCm 4.1. That special container then would need to be used for the builds for EPN temporarily.
        • Investigated the problem a bit with GDB during a global run:
          • There is an error message from the kernel module in dmesg.
          • Then processing on one GPU stops.
          • The application hangs indefinitely in a hipDeviceSynchronize call, the GPU doesn't respond any more to the application.
        • Possible alternative to run the TPC on the CPU:
          • The cluster finder is optimized for Pb-Pb and for the GPU. The implementation on the CPU is rather slow, and it gets extremely slow for sparse data. It is also unmaintained code after Felix left the group in Frankfurt.
          • Die some simple benchmarking, the majority of the time is spend in cleaning / filling charge maps. This runs at ~10 GB/s which is OK but also not great.
          • At the current speed, we would need ~220 EPNs for TPC CPU processing of sparse data.
          • This cannot be avoided without changing the way the clusterizer works (No problem on the GPU, since it has 1TB/s throughput, and it can better overlap with processing.
          • I see 2 ways for improvements:
            • Use vector instructions, which might require to compile the code with a proper -march flag.
            • I can tune something in the OpenMP scheme to process multiple sectors in parallel, That could yield a speedup of probably ~2x to 4x.
        • Johannes will be on vacation starting tomorrow and will be back at the end of next week. For a downgrade of the EPN servers he would be needed. He might be able to look into it next week.

      Issues on EPN farm affecting PDP:

      • AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.

      Issues currently lacking manpower, waiting for a volunteer:

      • Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
        • Discussed with Volker, should ideally be done by GSI group, since it mostly involves FMQ/DDS/ODC, which are all developed at GSI. Will contact Mohammad for that.
      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108

      Workflow repository

      • Waiting for AliECS to implement new fields in the GUI, demo version of GUI already implemented by Vasco.
      • New ODC deployed, O2 version will be made selectable 1.10.
      • New repository already in use, automatic merging of QC workflows implemented, next step is calibration workflows.

      EPN DPL Metric monitoring:

      • Metric data rate was too high, fix already deployed but not yet tested, due to problems in global runs with TPC.: https://alice.its.cern.ch/jira/browse/O2-2583

      Excessive error messages to InfoLogger:

      • Detectors are sending excessive error messages for corrupt raw data. We should reduce this in a way that we still see when there are errors, but we must not flood the log. This was observed yesterday from ITS and TOF.

      Missing errors messages / information from ODC / DDS in InfoLogger:

      • ODC does not forward errors written to stderr to the infologger, thus we do not see when a process segfaults / dies by exception / runs oom without checking the log files on the node. There is only the cryptic error that a process exited, without specifying which one. https://alice.its.cern.ch/jira/browse/O2-2604
      • PartitionID and RunNr are shown only for messages coming from DPL, not for those coming from ODC/DDC: https://alice.its.cern.ch/jira/browse/O2-2602.
      • Run is not stopped when processes die unexpectedly. This should be the case, at least optionally during commissioning: https://alice.its.cern.ch/jira/browse/O2-2554
      • ODC/FMQ is flooding our logs with bogus errors: https://github.com/FairRootGroup/ODC/issues/19. Fix available, to be deployed in next days.

      Memory monitoring:

      • We are missing a proper monitoring of the free memory in the SHM on the EPNs. Created a JIRA here: https://alice.its.cern.ch/jira/browse/R3C-638
      • When something fails in a run, e.g. GPU getting stuck (problem reported above), this yields unclear secondary problems with processes dying because they run out of memory. This happens because too many time frames are getting in flight. Hence we must need that limitation. This is already a JIRA in the FST section, just repeating it here since it affects data taking.
    • 11:20 AM 11:40 AM
      Full System Test 20m
      Speakers: David Rohr (CERN) , Ole Schmidt (CERN)

      Color code: (news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Full system test status

      • Full system test with 2 NUMA domains not terminating correctly. DD currently forwards EOS only to one consumer: https://alice.its.cern.ch/jira/browse/O2-2375. This is also a problem for some workflows on the EPN. We should think of a general solution.
        • Discussed possible solutions. Needs a new FairMQ feature, which is already in development: https://github.com/FairRootGroup/FairMQ/issues/384 (jira added)

      AMD / GPU stability issues in FST:

      • Compiler fails with internal error when optimization is disabled (-O0): Shall be fixed in ROCm 4.4. Waiting to receive a preliminary patch to confirm it is fixed.
      • Reported new internal compiler error when we enable log messages in GPU management code. Shall be fixed in ROCm 4.4.Waiting to receive a preliminary patch to confirm it is fixed.
      • Obtained a workaround for the application crash, now waiting for a set of RPMs that contains all fixes together.
      • Seeing random crashes during data taking with that new version, might be a regression. Was probably not seen in FST / standalone tests since it is data driven (doesn't seem to happen with the PbPb we use for most tests) and since it is rare and takes >10 hours to reproduce on a single node. We should discuss how future updates should be handled, but basically it is not feasible for me to test every new AMD driver in any possible data taking scenario for many hours on multiple nodes.

      GPU Performance issues in FST

      • One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
      • Performance of 32 GB GPU with 128 orbit TF less than for the 70 orbit TF we tested in August. Results a bit fluctuating, but average is between 1600 and 1650 GPUs (compared to 1475 GPUs for 70 orbit TF). Matteo has implemented a first version of the benchmark, it is currently running on the EPN.
      • New 6% performance regression with ROCm 4.3.

      Important general open points:

      • Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
      • https://alice.its.cern.ch/jira/browse/QC-569 : Processors should not be serialized when they access the same input data. (reported by Jens for QC, but relevant for FST). "Early forward approach implemented". Was stuck due to a bug in FMQ which is now fixed. CI now rerunning to validate. Afterwards I will check in FST.
      • Chain getting stuck when SHM buffer runs full: on-hold: long discussion last week but no conclusion yet. All discussed solutions require knowledge how many TFs are in flight globally on a processing node, which is not yet available in DPL. We have discussed a possible implementation, Matthias will create a Jira ticket detailing the implementation. In short: every sink without an output will send a dummy output, making the dummy-dpl-sink an implicit final object in the processing graph, which knows when a TF has finished processing. This information will be fed back into the raw data proxy (to be discussed how), which will delay injecting new TFs in the chain while the number in flight is above a certain threshold. In case this is realised via metrics, the difference must be build at the source to be synchronous.

      Minor open points:

      Detector status:

      • EMCAL errors (no crash, just messages, EMCAL is working on it).
      • V0 reconstruction added
    • 11:40 AM 12:00 PM
      Software for Hardware Accelerators 20m
      Speakers: David Rohr (CERN) , Ole Schmidt (CERN)
      Report on GPU memory micro benchmark progress
      • Still preparing a presentation with the results, to be shown in this meeting. Currently working on an ITS issue that needs fixing for the Pilot beam, will continue on the microbenchmark afterwards.

      ITS GPU Tracking and Vertexing:

      • Matteo will continue work on the tracking after the memory micro benchmarks are implemented.

      TRD Tracking

      • Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
      • David and Ole need to sit together to commission the TPC-TRD tracking on GPUs (after the vacation)
      • First version of the refit implemented, not checked yet by Ruben / David.

      GPU monitoring

      • Nothing happened yet, Johannes will follow this up with Alexander

      ANS Encoding (Michael's presentation)

      • Michael is going to add measurements for the total dictionary size and the total data size after compression to his slides
      • Measuring the total compression/decompression time needs C++ implementation. Current tests are done mostly with Python
      • One possible idea would be to use two dictionaries, an additional one for the rare symbols instead of writing them out uncompressed. But since the rare symbols contribute only little to the overall data volume one would probably not gain much in terms of compression. And incompressible symbols still need to be handled for those which are not part of either of the two dictionaries.
    • 12:00 PM 12:30 PM
      ANS Encoding Report 30m
      Speaker: Michael Lettrich (Technische Universitaet Muenchen (DE))