Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Europe/Zurich
Videoconference
ALICE GPU Meeting
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 11:00 11:20
      Discussion 20m
      Speakers: David Rohr (CERN), Ole Schmidt (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

       

      High priority framework topics:

      Other framework tickets:

      • After 64k vertex problem fixed, now problem in DebugGUI that it gets stuck at 100% CPU for large workflows https://alice.its.cern.ch/jira/browse/O2-3535
      • Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
      • Backpressure reporting when there is only 1 input channel: no progress.
      • Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
      • https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
      • https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
      • https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
      • https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
      • https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
      • https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
      • DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
      • Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
      • Support in DPL GUI to send individual START and STOP commands.
      • Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.

      Global calibration topics:

      • TPC calib problem: Need to repeat the test with latest fixed, but as we still have problems in all online runs, I think it will require more work..

      Async reconstruction

      • Async reco with 1NUMA domain setup:
        • Test with latest updates to be started by Costin today.
        • Current TF throughput of the processing only (neglecting all overhead)
          • ~13.5s per TF in 1GPU workflow running on full NUMA domain
          • ~6.5s in latest test of 1NUMA workflow (250 CTF files)
          • ~5.25s seconds in my latest local test.
        • Memory leak in async reco was not a real leak, but metrics queuing up since metric processing was too slow. Sped up gy Giulio meanwhile, and also disabled by Chiara.
      • Now can run with up to 500 CTF files, 1000 CTF files failed since output exceed max allowed size, limit to be raised and retried.
      • Problem with CCDB pulling in libUV, breaking ROOT since headers not available. Broke async reco with latest O2 tag. Fixed by Giulio: https://github.com/AliceO2Group/AliceO2/pull/10954.
      • Checked problem with severe serial overhead before / after processing: Actual 1 NUMA domain processing was only 50% of execution time of task:
        • Slow metric processing fixed. (Giulio)
        • Metric postprocessing by slow jq commands removed. (Chiara)
        • AOD merging parallelized (Chiara)
          • Should scale better with more CTF files, since then there'll be more AOD to be merged in parallel. Tried locally with 250 CTF files --> 5 AODs, and expected speedup of 5 seen.
      • Investigation of remaining oscilations:
        • GPU reconstruction sometimes slow. Not fully clear what happens, but GPUs seem sometimes stuck / slow for a while when the server is under high load. E.g. DMA transfers reduce from 30 GB/s to 30 MB/s, and then a transfer can last 20s.
          • Seen TPC tracking last up to 250s in such cases, compared to normal 3-4s. GPUs return to normal state with the next TF.
          • Not data-driven / not reproducible running the same TF.
          • Should be investigated and perhaps reported to AMD.
        • Other processes also need excessive time some times:
          • ITS tracking.
          • Secondary vertexing (up to 200s), seems to be a side effect of excessive ITS seeds.
          • EMCAL QC.
        • Problem is that such slow processes cause TFs to queue up in their input, so less TFs are in flight, leading to oscilations in the processing rate.

      EPN major topics:

      • EPN OS discussion
        • EPNs bumped to Alma 8.7 / ROCm 5.4.2, still no replay from AMD for official support, All Jenkins builders / CI containers / sync / async software versions bumped to new setup.
      • Fast movement of nodes between async / online without EPN expert intervention.
      • Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
      • Check total bytes written on the SSDs, to get an impression how much of the lifetime of the SSDs we have already used, switch to using a ramdisk if necessary: https://alice.its.cern.ch/jira/browse/EPN-198 https://alice.its.cern.ch/jira/browse/EPN-237
        • Max 11% lifetime used, can delay deployment of RAMDISK cache to later this year.
      • Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
      • Need DDS/ODC feature to deploy different topologies on EPNs with MI50 and with MI100. ETA End of March + ~2 weeks
      • Go to error state if a critical task (e.g. calib task) fails (taking nmin into accound). But currently we do not see failing calibration tasks at all, except for a message in infologger. ODC should go to error, and then ECS should stop the run automatically. Also when n < nmin. ETA End of March

      Other EPN topics:

      EPN farm upgrade:

      • Identified the crashes on MI100 to be 2 independent problems:
        • Track in TPC track model encoding
        • Corrupt data outputed to host that makes ITS TPC matching crash due to out of bound cluster indizes.
      • In both cases, the problem seems to be incorrect synchronization between kernel calls and DMA transfers. Synchronization is requested correctly by AliceO2, but not respected by the GPU / the kernel driver / the ROCm runtime. AMD needs to investigate.
      •  

      Full system test issues:

      • Full system test back to stable, had to revert one more MCH PR that broke it (for unclear reasons, Laurent is investigating).
      • FST with pp data crashes on the EPNs in digitization phase due to alignment problem of DCS CCDB object. Could be a ROOT bug. Ruben is investigating.

      Topology generation:

      • Ole is investigating to use set -u or set -e to catch more errors, but has some drawbacks. Current plan is to use -u, but not -e. To be merged when Ole is back from vacation in 2 weeks.

      Software deployment at P2.

      • O2PDPSuite updated on Monday, some failures necessitated another update.
        • Check for duplicate outputs in topology was added, but MCH reco had duplicate outputs.
        • Severe performance regression in CTF code due to accidentally committed -O0 compile flag.
        • FST crashes with O2/dev on .tf files, works on .raw files, since runNumber 0 of raw .tf MC files not supported. Fixed.
           

      Switch to 32 orbit TF:

      • TOF switched their code to read TF length from GRP, to be checked, now all detectors have adapted.
      • Provided new 32 orbit SYNTHETIC datasets for P2, currently Pb-Pb only since pp simulation fails.

      QC / Monitoring / InfoLogger updates:

      • TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.

      CCDB topics.

      AliECS related topics:

      • Improve error message in AliECS GUI for EPN related failures. PDP error messages are sent via ODC in the Run reply, e.g. for topology generation failures, but ECS does not show them, but only shows generic "EPN Partition Initialize Failed" https://alice.its.cern.ch/jira/browse/OCTRL-734
        • ODC / Topology generation error messages now shown in the ECS GUI. Though GUI is a bit ugly and text is a bit convoluted with other content. Vasco is aware and they will clean it up in the next releases.

      GPU ROCm / compiler topics:

      • Changed build system to allow also binaries for different CUDA architectures in the same build.
      • Regression in ROCm 5.4.3 still not understood, reported to AMD, but current MI100 problem has priority.
      • Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
      • Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
      • Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
      • Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
      • Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.

      TPC GPU Processing

      • Random GPU crashes under investigation.
      • Investigating GPU tracking with distortion corrections. Something is clearly wrong already during the seeding. Single high Pt tracks are sometimes not / only partially found.

      ANS Encoding

      • Michael working on PR, discussion in JIRA, looks good so far, hopefully have first version end of the week.

      Issues currently lacking manpower, waiting for a volunteer:

      • For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
      • Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
    • 11:20 11:25
      TRD Tracking 5m
      Speaker: Ole Schmidt (CERN)
    • 11:25 11:30
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)
      • Currently: developing Road finding (traverse the tree of neighbour cells and create all candidated topologies)
        • Traversing the tree is recursive function, no easy prediction on number of combinations
        • Operations with indexes (Cells), reduced amount of candidates at this point
        • Dryrun, fill index table, rerun with known starting points
           
      • PRs pending:
      • CHEP plans:
        • Road finding can be a realistic and reasonable point to arrive at.
        • Understanding reasonably approvable plots for that time
        • Overall I never see timing performance regression wrt CPU (with no discussion on optimisation), nor exaggerated speedups, there are features which can still be valuable (automatic code generation for HIP, intra TF flexibility, memory size configurability)
           
      • TL;DR: ITS GPU Vertexer (TimeFrame "splitting")
         

      Step

      on GPU

      CPU consistency

      Speedup

      Optimised

      Comments/Improvements

      CUDAHIP
      Tracklet Finder

      ✅ 

      ✅ 

      ~

      🚧

       

      Tracklet Selection

      ✅ 

      ✅ 

      ~

      🚧

       

      Cell finder

      /

      /

      /

      Investigate after tracking

      • TL;DR: ITS GPU Tracker (TimeFrame "splitting")
         

      Step

      on GPU

      CPU consistency

      Speedup

      Optimised

      Comments/Improvements

      CUDAHIP
      Tracklet Finder

      ✅ 

      ✅ (δ~1‰)

      ~

      🚧

      multi-ROF sliding window

      Trkl duplicate finder

      ✅ 

      ✅ 

      ~

      ✅ 

      thrust & CUB mainly

      Cell finder

      🚧

      ~

      🚧

       

      Cell neghbour finder

      🚧

      ~

      🚧

       

      Road finder

      🚧

      🚧

      ~

      🚧

       

      🚧

      🚧

      Track fitting

       

       

    • 11:30 11:35
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (Heidelberg University (DE))

      - Ideal clusterizer is done. Center-of-Gravity is calculated on-the-fly for digits which belong to the same cluster. Maxima and Center-of-Gravity are calculated for all digits with the same MC label and in a (time, pad)-window of (16,6) to find single maxima for looper tracks
      - Post-processing for the neural network: Remove points where (maximum_pad-CoG_pad + maximum_time-CoG_time) > 2. This is done not to confuse the network (outlier-removal) and it anyway captures most of the distribution of points.
      - Efficiency (total maxima found by clusterizer / total maxima found by ideal clusterizer), fake-rate (how many maxima were found by clusterizer which are not found in the ideal clusterizer) and clone-rate (how many maxima are found by ideal clusterizer for which more than one maximum is found by real clusterizer) of maxima is important and ongoing.
      - Maximum finding in 3D might improve results: Needs some thought on how to pass values to network (pad offsets and crossing sectors / ROCs)