Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      PDP Run Coordination 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Event Display Commissioning

      • Supplier has informed us that despite the ED PC being confirmed as in stock before, they cannot ship it before January. Since we have a temporary setup now, we'll only replace the PC for next year. In the meantime Guy is also checking alternatives in case they still cannot deliver in January.

      Problems during operation

      • TPC GPU processing crashing regularly since we updated to ROCm 4.3.
        • I have a reproducer with data replay from recorded raw data. 50 TB dataset. So far I could not identify a single TF in the dataset that causes the issue.
        • The same dataset crashes also in ROCm 4.1, however with a different ROCm error message. Not clear if it is the same issue or not --> Downgrade to ROCm 4.1 makes no sense.
        • I have found a workaround, which costs a factor 2-3 in performance, but avoids the crash. Should be sufficient for the pilot beam.
        • Issue is data driven, not reproducible with Pb-Pb MC, or with other data we recorded before.
        • Currently investigation stalled due to EOS problem, data replay from EOS constantly gets stuck, and I cannot store the 50 TB locally. EPN folks and Latchezar are investigating.
      • Seeing backpressure messages from tpc-its-matcher in global runs, despite processing speed is sufficient. Over time, the SHM segment runs full and all runs fail after ~10 minutes - not clear whether this is related.
        • Ruben has a reproducer that can run locally.
        • If the its-tpc-matcher is removed, we see some backpressure messages from its-raw-decoder to readout-proxy instead (not there before), but the SHM problem is gone.
        • To be investigated with Giulio.
      • RPMs of nightly builds not available on EPNs. We have nightly builds of the O2PDPSuite for the EPNs now, but the RPMs are not available. That should be fixed ASAP, such that we can update the software more easily / flexibly.

      Issues on EPN farm affecting PDP:

      • AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.

      Issues currently lacking manpower, waiting for a volunteer:

      • Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region? Related issue: what happens if we start multiple async chains in parallel --> Must also guarantee good NUMA pinning.
        • Contacted GSI group whether they can implement this, no reply yet.
      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108

      Workflow repository

      • Waiting for AliECS to implement new fields in the GUI, demo version of GUI already implemented by Vasco.
      • New repository already in use, automatic merging of QC workflows implemented, next step is calibration workflows.
      • Need additional DPL feature to automatically connect reconstruction and calibration workflows: https://alice.its.cern.ch/jira/browse/O2-2611

      Changes for October 1st:

      • Login as epn user will be disabled for detector expert, login should be via NICE credentials. EPN still needs to fix the log file mode issue, such that experts can read logs: https://alice.its.cern.ch/jira/browse/O2-2555
      • Automatic loading of DD/QC/O2 will be disabled for workflows, only workflows that load modules explicitly will keep working. https://alice.its.cern.ch/jira/browse/O2-2553
      • With the next O2 update, workflows will depend on ODC. Users should load O2PDPSuite to have all required dependencies.

      EPN DPL Metric monitoring:

      • Too high metric rate fixed, now in operation: https://alice.its.cern.ch/jira/browse/O2-2583

      Excessive error messages to InfoLogger:

      • Reducing info logger messages ongoing.

      Missing errors messages / information from ODC / DDS in InfoLogger:

      • ODC does not forward errors written to stderr to the infologger, thus we do not see when a process segfaults / dies by exception / runs oom without checking the log files on the node. There is only the cryptic error that a process exited, without specifying which one. https://alice.its.cern.ch/jira/browse/O2-2604
      • PartitionID and RunNr are shown only for messages coming from DPL, not for those coming from ODC/DDC: https://alice.its.cern.ch/jira/browse/O2-2602.
      • Run is not stopped when processes die unexpectedly. This should be the case, at least optionally during commissioning: https://alice.its.cern.ch/jira/browse/O2-2554
      • ODC/FMQ is flooding our logs with bogus errors: https://github.com/FairRootGroup/ODC/issues/19. Fixed with the next O2 version, need to update.

      Memory monitoring:

      • We are missing a proper monitoring of the free memory in the SHM on the EPNs. Created a JIRA here: https://alice.its.cern.ch/jira/browse/R3C-638
      • When something fails in a run, e.g. GPU getting stuck (problem reported above), this yields unclear secondary problems with processes dying because they run out of memory. This happens because too many time frames are getting in flight. Hence we must need that limitation. https://alice.its.cern.ch/jira/browse/O2-2589

      EOS Cleanup:

      • Currently 20 PB of data on EOS disk buffer, all data currently written not in file catalogue, mostly raw data. When we switch to EPN2EOS for the transfer, all data will go to the catalogue. Then we have to disable the old scripts, and we have to do a cleanup campaign. We should give detectors a phase of ~2 months to mark what data is relevant, and then wipe all the rest.
    • 11:20 11:40
      Full System Test 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      Full system test status

      • Full system test with 2 NUMA domains not terminating correctly. DD currently forwards EOS only to one consumer: https://alice.its.cern.ch/jira/browse/O2-2375. This is also a problem for some workflows on the EPN. We should think of a general solution.
        • Discussed possible solutions. Needs a new FairMQ feature, work in progress by Dennis, he assumes to have a first version beginning of this week: https://github.com/FairRootGroup/FairMQ/issues/384 (jira added)

      AMD / GPU stability issues in FST:

      • Compiler fails with internal error when optimization is disabled (-O0): Shall be fixed in ROCm 4.4. Waiting to receive a preliminary patch to confirm it is fixed.
      • Reported new internal compiler error when we enable log messages in GPU management code. Shall be fixed in ROCm 4.4.Waiting to receive a preliminary patch to confirm it is fixed.
      • Obtained a workaround for the application crash, now waiting for a set of RPMs that contains all fixes together.
      • Random crashes processing cosmics data with both ROCm 4.1 and 4.3, data driven, not seen before. Could be the same ~8 hour MTF crash we had before, which was never understood and disappeared at some point.

      GPU Performance issues in FST

      • One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
      • Performance of 32 GB GPU with 128 orbit TF less than for the 70 orbit TF we tested in August. Results a bit fluctuating, but average is between 1600 and 1650 GPUs (compared to 1475 GPUs for 70 orbit TF). Matteo has implemented a first version of the benchmark, it is currently running on the EPN.
      • New 6% performance regression with ROCm 4.3.

      Important general open points:

      • Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
      • https://alice.its.cern.ch/jira/browse/QC-569 : Processors should not be serialized when they access the same input data. (reported by Jens for QC, but relevant for FST). "Early forward approach merged, checked by Jens, speeds up the processing as expected".
      • Chain getting stuck when SHM buffer runs full: on-hold: long discussion last week but no conclusion yet. All discussed solutions require knowledge how many TFs are in flight globally on a processing node, which is not yet available in DPL. We have discussed a possible implementation, described by Matthias here https://alice.its.cern.ch/jira/browse/O2-2589. To be implemented either by Giulio or by Matthias.

      Minor open points:

      Detector status:

      • EMCAL errors (no crash, just messages, EMCAL is working on it).
      • V0 reconstruction added
    • 11:40 12:00
      Software for Hardware Accelerators 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)
      Report on GPU memory micro benchmark progress
      • Matteo is working to improve the benchmark.

      ITS GPU Tracking and Vertexing:

      • Matteo will continue work on the tracking after the memory micro benchmarks are implemented.

      TRD Tracking

      • Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
      • David and Ole need to sit together to commission the TPC-TRD tracking on GPUs (after the vacation)
      • First version of the refit implemented, not checked yet by Ruben / David.

      GPU monitoring

      • Nothing happened yet, Johannes will follow this up with Alexander

      ANS Encoding (Michael's presentation)

      • Michael is going to add measurements for the total dictionary size and the total data size after compression to his slides
      • Measuring the total compression/decompression time needs C++ implementation. Current tests are done mostly with Python
      • One possible idea would be to use two dictionaries, an additional one for the rare symbols instead of writing them out uncompressed. But since the rare symbols contribute only little to the overall data volume one would probably not gain much in terms of compression. And incompressible symbols still need to be handled for those which are not part of either of the two dictionaries.