Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 AM 11:20 AM
      Discussion 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      High priority framework topics:

      • Problem with EndOfStream / Dropping Lifetime::Timeframe data
        • Latest update on oldestPossibleTimeframe / dropping lifetime::timeframe:
          • Most errors online are gone, InfoLogger much cleaner than before.
          • This severely affects async reco on the EPNs, majority of jobs go to error.
            • Both the current async tag, and David's test on the EPN were done with O2 versions that have known problems, which are meanwhile fixed. Need to recompile with O2/dev on the EPNs and retry.
            • Didn't see the problems anymore with O2/dev. Hopefully it is fixed, but should wait for larger scale tests on the GRID.
          • Still issues in online runs, asked Giulio to implement 3 checks (opened 3 JIRA tickets).
            • 1 improved error messages implemented and merged
            • 1 in PR, reveals problems in user code. To be followed up by David and QC experts before PR can be merged.
            • 1 still in development.
      • Fix START-STOP-START for good
        • https://github.com/AliceO2Group/AliceO2/pull/9895 , rebased and merged
        • With the above PR, and one more fix during the week, START/STOP/START seems to reset the DPL counters correctly.
        • Had some successful tests with TOF, but fails if we add all detectors:
          • MCH / ITS / EMC crash in QC postprocessing, Barth / Piotr are following this up.
          • TPC Track QC is crashing, but unrelated - can always happen when the run is short. Robert is checking.
      • Problem with QC topologies with expendable tasks- For items to do see: https://alice.its.cern.ch/jira/browse/QC-953 - Status?
      • Problem in QC where we collect messages in memory while the run is stopped: https://alice.its.cern.ch/jira/browse/O2-3691
        • Latest fixes tested and fully working. All memory is freed when we go to ready.
        • Can be closed from DPL side.
        • Still a problem when there is a spike of messages, since FMQ does not support a global limit of the network buffers by design.
        • To me this is a design flaw. Need to discuss with them if such a limit can be added, or how we can work around it: https://alice.its.cern.ch/jira/browse/O2-4414
      • Switch 0xdeadbeef handling from on-the-fly creating dummy messages for optional messages, to injecting them at readout-proxy level.
        • Done
        • Was needed more urgently for CTP QC, and RC said waiting for detectors to change their workflows might take long. So we changed all workflows centrally.
        • Updated documentation, and asked detectors to double-check.
      • New issue: sometimes CCDB populator produces backpressure, without processing data. Crashed several Pb-Pb runs yet: https://alice.its.cern.ch/jira/browse/O2-4244
        • Disappeared after disabled CPV gain calib, that was very slow. However, this can only have hidden the problem. Apparently there is a race condition that can trigger a problem in the input handling, which makes the CCDB populator stuck. Since the run funciton of the CCDB populator is not called and it does not have a special completion policy, but simply consumeWhenAny, this is likely to be a generic problem.
        • Cannot be debugged Pb-Pb right now, since it is mitigated. But must be understood afterwards.

      Other framework tickets:

      Global calibration topics:

      • TPC IDC workflow problem.
      • TPC has issues with SAC workflow. Need to understand if this is the known long-standing DPL issue with "Dropping lifetime::timeframe" or something else.
      • Even with latest changes, difficult to ensure guaranteed calibration finalization at end of global run (as discussed with Ruben yesterday).
      • Problem with endOfStream in the middle of a run, stopping calib processing: fixed.

      CCDB:

      • Bump to libwebsockets 4.x / JAliEn-ROOT 0.7.4: Done

      Async reconstruction

      • Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes.
        • Checking 2 things: does the situation get better without GPU monitoring? --> Inconclusive
        • We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
      • Chiara reported again lower performance on both MI50 and MI100 EPNs in async reco, needs to be investigated.
        • Discussed with Chiara how I can reproduce it, but didn't check yet.
      • Async reco performance:
        • Work in progress - already a significant speed up but not enough.
        • TOF matching now supports multi-threading, which should remove it from the critical path of the latency.

      EPN major topics:

      • Fast movement of nodes between async / online without EPN expert intervention.
        • 2 goals I would like to set for the final solution:
          • It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
          • We must not lose which nodes are marked as bad while moving.
      • Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
        • Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
      • Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
      • DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
      • TfBuilders should stop in ERROR when they lose connection.

      Other EPN topics:

      Raw decoding checks:

      • Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.

      Full system test issues:

      Topology generation:

      • Should test to deploy topology with DPL driver, to have the remote GUI available. Status?

      QC / Monitoring / InfoLogger updates:

      • TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.

      AliECS related topics:

      • Extra env var field still not multi-line by default.

      FairMQ issues:

      • After switch to FMQ 1.8.2, and redeployment of SHM management tool, most issues are solved, but some runs, e.g. TPC Laser, regularly fail.
      • Problem is too small FMQ refCount segment.
      • They do not want to increase the default size --> we must add a parameter to the creation code in the EPN SHM tool and in DataDistribution.
        • David can take care of the SHM tool, created a JIRA assigned to EPN for the DataDistribution part.

      High priority RC YETS issues:

      • All crashes of code under PDP responsibility fixed. Remaining crashes: 1 in QC, 2 in DataDistribution
      • Make Start / Stop / Start work: All known framework issues fixed. Remaining problems: 1 in Readout, 2 in QC
      • Fix dropping lifetime::timeframe for good: Work in progress, 2 features available (1 requires fixing the errors in user code it revealed), 1 feature in development.
      • Fix CTP QC / allow FIT to send non-raw data from FLPs / update to new 0xDEADBEEF mechanism: Done
      • Fix problem with input-proxy not dropping data: fixed, but now we need to make sure FMQ does not need too much memory if there are data spikes.
      • Expandable tasks in QC: Waiting for Giulio and Barth to investigate the current problem. In principle all is ready from both sides, but requires fixes.
      • Stabilize calibration / fix EoS: We have a plan how to implement it. Will take some time, but hopefully before restart of data taking.
      • Fix problem with ccdb-populater: no idea yet, no ETA.
      • Added a new issue: if one EPN is slow during node allocation, that kills the whole run, even if nmin is fulfilled. Happened 3 times during the tests on Tuesday. Opened https://alice.its.cern.ch/jira/browse/EPN-432

      GPU ROCm / compiler topics:

      • Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
      • Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
      • Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
      • Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
      • Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
      • New compiler regression in ROCm 5.6, need to create testcase and send to AMD.
      • AMD provided custom ROCm 5.7.1:
        • They did not manage to fix the compiler issue yet, thus the new ROCm has an enviromnet variable option, that restores the old behavior for register spilling, which is not broken for us. Final fix is still in development.
        • Unfortunately, there is a new regression in ROCm 5.7.1, and we cannot use it. Created a reproducer and send bug report to AMD.
        • Didn't check yet if any of the other pending ROCm issues are fixed with 5.7.1

      TPC GPU Processing

      • Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
      • Online runs at low IR / low energy observe weird number of clusters per track statistics.
        • Problem was due to incorrect vdrift, though it is not clear why this breaks tracking so badly, being investigated.
        • vDrift was off so much, that many tracks were cut away by eta cut, and were longer than 250 cm in z and thus track following was aborted.
      • Ruben reported an issue with global track refit, which some times does not produce the TPC track fit results.
        • All issues either fixed, or understood and originating from differences in the original track paramters when the fit starts.
      • New problem with bogus values in TPC fast transformation map still pending. Sergey is investigating, but waiting for input from Alex.
      • TPC has provided the dead channel map as calib object. Next step now is to respect it during tracking, and do not abort tracking if no hits are found when the channels are dead.

      Issues currently lacking manpower, waiting for a volunteer:

      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
      • Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
    • 11:20 AM 11:25 AM
      TRD Tracking 5m
      Speaker: Ole Schmidt (CERN)
    • 11:25 AM 11:30 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

       

      Pad-parallel tagging

      • Tag / exclude clusters which cross a pad row tangentially
      • Apply the looper tagger twice with different settings
      • Overall change in efficiency / clone rate / fake rate / total number of digit maxima

       

      raw

      looper tagging

      parallel pad tagging

      Efficiency (%)

      93.34

      93.31

      94.04

      Clone (%)

      7.56

      6.75

      5.82

      Fake (%)

      19.24

      17.31

      16.79

      Digit maxima (mio.)

      25.8

      22.6

      16.2

       

      Differential studies (for one sector)

      • Started with pT and η -> Can only analyse clusters which have an assignment (obviously)
      • Network mostly tags assigned clusters with very high probabilites (this is with looper tagging and pad-parallel tagging)
      • Black line is a typical cut-off line for the network (0.16 in this case)
      • At higher pT, the density seems to increase in favor of the network -> high-pT clusters are tagged correctly

       

       

       

       

       

       

       

       

       

       

       

       

       

      Neural network GPU speed

      • Bug in PyTorch: Conv3D layer (used for 3D network) does not properly utilize the MM GPU kernels
      • However FC layers do!
      • Conv3D ~0.5-1 TFLOPs ; FC layers: 10 - 20 TFLOPs
      • With only FC layers: Processing ~70-80 mio. clusters / s for classification network (relatively huge: 110k trainable parameters) on MI100
    • 11:30 AM 11:35 AM
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)