Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Need to check ROCm 7.2 corrtecness.
      • Need to understand and fix crash on RTX Pro 6000.
      • Understand deterministic mode issue on NVIDIA Blackwell.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Maps now yielding correct results, but 1.5x performance regression running on GPUs.
        • Could not find the problem yet (also did not have much time so far). The problem is that I cannot bisect it, since intermediate commits do not run on GPU. Probably have to rewrite the PR and split it in steps that can run on GPU, to see what triggers the slowdown.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • TPC decoding can now handle data that was encoded with wrong b field, memory scaling factors parameters are configurable as --configKeyValue.

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
      • Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
        • CI fully green now. Need to check on EPNs, and have to do a special build for anaylzers to check, particular the bump of cgal.
      • Test at NERSC still ongoing, all jobs so far failed for non-gpu related reasons (issues with the farm, no free slots for several days, wrong JDL, old JAlien version, ...)

       

      EPN GPU Topics:

       

    • 10:20 AM 10:25 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster error tuning

       

      • Automatized search (choose half widths around optimum and submit next iteration)
      • Adjusted search metric: Correctly attached non-fake * average efficiency (primaries)
        • Efficiency improved slightly
      • Adjusted network input: All cluster flags, invCharge, invAvgCharge, mult
        • That didn't change much...
      • Added new scaling parameters for chi cuts in tracking

        • First search: scaleChiY1, scaleChiZ1, nnScaleClusterErrorY, nnScaleClusterErrorZ

       

      • Next step: Need new (more realistic) simulation
      • Tighten cuts on training data further to have a better cluster error estimate
    • 10:25 AM 10:30 AM
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      GPU Parameters news

      Tuned MI210 without RTC


      Event 0: 4.24 s --> 3.15 s (- 25,7 % )

      Event 1: 4.60 s --> 3.70 s (-19,6 %)

      Event 2: 4.77 s --> 3.77 s (-21,0 %)

      Google doc

    • 10:30 AM 10:35 AM
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
       

      NextGenTrigger Task 1.7

      • CHEP talk together with Jolly got accepted.
      • Presenting "proof of concept" to imrove CMS clustering algorithm CLUE to Felice and NGT.
        • They liked it and they want to work it out.

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Worked on pull-request for O2.
      • Tried also Gabriele's benchmarking tool.
        • See appended CSV file or below for a comparison to standalone benchmark.
        • Made it part of CI-pipeline that benchmarks on NGT GPUs.
      • Next Steps:
        • Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
        • Make better use of SoA to improve performance
        • Try David's suggestion

      Comparison Standalone / rocprofv2

      name time (standalone) time (rocprofv2)

      absolute difference

      GPUTPCNeighboursCleaner 4952.69666666667 3891.71675 0.214222672631543
      GPUTPCNeighboursFinder 37099.65 35538.7530833333 0.042073090087553
      GPUTPCStartHitsFinder 1509.4375 725.102916666667 0.519620443597919
      GPUTPCStartHitsSorter 3609.17333333333 2646.44583333333 0.266744600795011
      GPUTPCTrackletConstructor 1225978.33333333 1224653.22916667 0.001080854474046
      GPUTPCTrackletSelector 36015.6333333333 34149.2568333333 0.051821287792616
    • 10:35 AM 10:40 AM
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 AM 10:45 AM
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      CI Server: No update from Sergio yet.

      Highly Ionizing Particles

      Opened PR for the current implementation.

      Filter Performance

       

      Reran filter on new simulation data with Mesut's suggestions. (brackets: old values)

      Good: 

      • False positive rate went down.

      Bad:

      • Still only about 40% of tails filtered.

      Ugly:

      • Even threshold of 5 ADC, filter performance doesn't change meaningfully...

       

      Confusion matrix
       
      Cutoff Threshold Correctly Zeroed Tails Leaked Correctly Kept False Positive Precision (# correctly zeroed / # total zeroed)
       150 ADC  0.4142 (0.3910) 0.5858 (0.6090) 0.9987 (0.9977) 0.0013 (0.0023) 0.6513 (0.5370)
      50 ADC 0.4301 (0.4115) 0.5699 (0.5885) 0.9987 (0.9974) 0.0013 (0.0026) 0.6597 (0.5161)
      5 ADC 0.4782 0.5218 0.9986 0.0014 0.6719

      Saturated Clusters

      • First Proof-Of-Concept implementation in place.
      • Still need to evaluate clusters
      • Find a way to parallelize implementation
       
    • 10:45 AM 10:50 AM
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Gabriele:

      New implementation:

      Precision: 0.9928, Recall: 0.9133, F1: 0.9514
      Total true vertices: 150
      Matched vertices: 137
      Missed true vertices: 13
      Total fitted vertices: 138
      Fake vertices: 1
      Mean signed residuals (x,y,z): [0.00038846 0.00010329 0.00066991]
      Std  signed residuals (x,y,z): [0.00853282 0.00977771 0.0052135 ]
      Mean absolute residuals (x,y,z): [0.0043629  0.00525042 0.00273546]
      Std  absolute residuals (x,y,z): [0.00733391 0.00823679 0.00448273]

      Missed true vertices:
        ROF         x         y         z  nContributors
       11.0 -0.000259  0.022460  -3.69227           10.0
       13.0 -0.000585  0.002898   8.07200           11.0
       30.0 -0.024436 -0.009381   7.03565            1.0
       35.0 -0.023708 -0.009344  -5.51584            1.0
       48.0 -0.010583 -0.008603  -3.05986           14.0
       75.0  0.011355 -0.002970   1.90074            1.0
       90.0  0.002929 -0.018336   2.66971            1.0
      102.0  0.001209  0.020929  -4.71790            1.0
      123.0 -0.001262 -0.002698   8.83149           22.0
      128.0  0.006334 -0.001214  -5.15652           10.0
      158.0 -0.022409 -0.003296  -7.26120            3.0
      161.0  0.002261  0.001900 -10.54620            1.0
      164.0 -0.008004  0.000373  -5.64786            8.0

      Old implementation:

      Precision: 1.0000, Recall: 0.9067, F1: 0.9510
      Total true vertices: 150
      Matched vertices: 136
      Missed true vertices: 14
      Total fitted vertices: 136
      Fake vertices: 0
      Mean signed residuals (x,y,z): [-7.04813868e-05  1.47534449e-04  3.28789706e-04]
      Std  signed residuals (x,y,z): [0.00157584 0.00351721 0.00311071]
      Mean absolute residuals (x,y,z): [0.00092318 0.00105691 0.00067765]
      Std  absolute residuals (x,y,z): [0.00127659 0.00335669 0.00305332]

      Missed true vertices:
        ROF         x         y         z  nContributors
       11.0 -0.000259  0.022460  -3.69227           10.0
       30.0 -0.024436 -0.009381   7.03565            1.0
       35.0 -0.023708 -0.009344  -5.51584            1.0
       48.0 -0.010583 -0.008603  -3.05986           14.0
       49.0 -0.011022  0.000661  -3.79249           39.0
       75.0  0.011355 -0.002970   1.90074            1.0
       90.0  0.002929 -0.018336   2.66971            1.0
       97.0  0.011611  0.000815   0.00732           28.0
      102.0  0.001209  0.020929  -4.71790            1.0
      123.0 -0.001262 -0.002698   8.83149           22.0
      128.0  0.006334 -0.001214  -5.15652           10.0
      158.0 -0.022409 -0.003296  -7.26120            3.0
      161.0  0.002261  0.001900 -10.54620            1.0
      164.0 -0.008004  0.000373  -5.64786            8.0

       

      Next steps:

      • Use different trackleting and improve fitting to reduce residuals
      • Attempt another approach suggested by Felix still using Ray Tracing techniques
        • Move from sampling of line inside voxels to voxel traversing
    • 10:50 AM 10:55 AM
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)