Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Need to check ROCm 7.2 corrtecness.
      • Need to understand and fix crash on RTX Pro 6000.
      • Understand deterministic mode issue on NVIDIA Blackwell.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Maps now yielding correct results, but 1.5x performance regression running on GPUs.
        • This happens on all GPUs, but not on the processor. Register / shared / local memory usage of the TrackFit kernel (which is heavily affected) is the same with and without the PR. Not clear yet what is the problem.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • TPC decoding can no handle data that was encoded with wrong b field, memory scaling factors parameters are configurable as --configKeyValue.

       

      Other topics:

      • Need to bump ONNXRuntime to 1.24, Giulio is checking, needed for ROCm 7.2 - Status?
      • Status of bumping CMake and boost (https://github.com/alisw/alidist/pull/6135):
        • All known issues fixed, but new issue compiling Geant4 on MacOS (this was working few days ago). Need to check.

       

      EPN GPU Topics:

       

    • 10:20 AM 10:25 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster error parameterization

       

      • Continued with grid searches + interpolation:
        • New procedure for finding minima:
          • Run approximate grid ranges (parallelizable)
          • Fit small NN to surface (interpolation, 2D)
          • Use gradient descent to find extremum
          • Sample Latin hypercube grid around extremum
          • Iterate until satisfaction

       

      1. NN trained with mC[0] < 0.5 && mC[2] < 0.5 && abs(dy) < 2 && abs(dz) < 2, optimization strategy: maximizes number of correctly attached clusters

      Plot number of correctly attached clusters as a function of scale parameters, interpolated by NN

      (sampling from regular grid by LHS (Latin Hypercube sampling) visible around predicted minimum)

      Using config close to minimum

      Efficiency worse, but better fake and clone rate. And much longer tracks compared to default reco -> Less clones?!
      Finds less tracks than default reco, but above NCl > 60 it finds more tracks than the default reco

       

      2. NN trained with mC[0] < 0.25 && mC[2] < 0.25 && abs(dy) < 1.5 && abs(dz) < 1.5, optimization strategy: maximizes number of correctly attached clusters

      • Similar picture
      • Correctly attached clusters, default reco: 18.06 mio. ; ..., closest NN config : 18.39 mio..

       

      3. NN trained with mC[0] < 0.25 && mC[2] < 0.25 && abs(dy) < 1.5 && abs(dz) < 1.5, Change of optimization strategy: Optimize for "correctly attached non-fake clusters"

      • Best configurations:
        • y = 0.252811, z = 2.21427 (correctly attached opt. from previous bullet point)
        • y = 0.177332, z = 2.91026 (correctly attached, non-fake opt.)
      • Clearly differ between both optimization strategies...

       

      • Seems to work a lot better to reduce fake-rates
      • NCl peak behaviour at 120 much closer to current reco.
      • Significantly reduced short tracks

       

      4. NN trained with mC[0] < 0.25 && mC[2] < 0.25 && abs(dy) < 1.5 && abs(dz) < 1.5, Change of optimization strategy: Maximize "correctly attached non-fake clusters - fake attached clusters"

       

      Using the same data as above:

      This means, the optimization objective has shifted towards higher scaleZ and lower scaleY factors than previously!

      Extend the grid.

      Take one direction through phase-space:

      vs local Y:

      vs pT:

      Similar pictures vs. Eta, Phi and Z -> Its not a random spike somewhere, it's rather a lower efficiency, clone and fake rate across the phase-space

       

      Effect of increasing scaleZ parameter:

      • Lower efficiency
      • Lower fake rate
      • Lower clone rate

       

      Now the other direction:

      Effect of increasing scaleY parameter:

      • Lower efficiency
      • Higher fake-rate
      • Lower clone-rate

       

      Lessons learned:

      • Fake and clone rate can be optimised with the shown optimization strategies
      • Efficiency is not well optimized with the taken metrics -> New metric needed?! What to take?

       

      Best configurations overview:

      • y = 0.252811, z = 2.21427 (correctly attached, tighter cuts)
      • y = 0.177332, z = 2.91026 (correctly attached non-fake)
      • y = 0.148339, z = 4.30489 (correctly attached non-fake - fake attached)

       

      One more study: Effect of network size

      Use networks:

      1. 2 layers, 32 neurons per layer
      2. 4 layers, 32 NpL
      3. 8 layers, 32 NpL
      4. 16 layers, 32 NpL (all cases from studies above)

       

      1.

      2.

      3.

      4.

      -> Scale factors need to be tuned individually per network and might be somewhat sensitive to networks ability to fit the underlying data

      1. scaleY = 0.104, scaleZ = 3.96
      2. scaleY = 0.124, scaleZ = 2.776
      3. scaleY = 0.171, scaleZ = 2.30
      4. scaleY = 0.141, scaleZ = 6.45 (and previous: scaleY = 0.148, scaleZ = 4.30)

       

      Lesson learnt:

      • ScaleZ (= scaling in time / Z direction) is quite volatile...

       

      Caveats and further development:

      • This is now tuned with dataset from centrality enforced 38 kHz simulations (LHC24ar apass2 anchored) -> Maybe switch to dataset of lower occupancy or combine different datasets
      • Optimization strategies (correctly attached clusters) is only available in MC data, would prefer metric for real data tuning
      • To be seen how well this would extrapolate to real data (future development)

       

       

      Final Q:

      • In GPUQA: "Correctly attached clusters" and "Correctly Attached all-trk normalized" are not identical

    • 10:25 AM 10:30 AM
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      RTC compilation bugged for rocm 7.2?

      ROCR_VISIBLE_DEVICES=1 ./ca -e lhc24ar_raw --gpuType HIP --memSize 15000000000 --inputMemory 6000000000 --outputMemory 10000000000 --sync -s 0 -n 1 --runs 10 --PROCdoublePipeline --RTCenable
      ...
      RTC Compilation finished (38.868433 seconds)
      HIP Initialisation successfull (Device 0: AMD Instinct MI60 / MI50 (Frequency 1725000, Cores 60), 1073741824 / 15000000000 bytes host / global memory, Stack frame 8192, Constant memory 25831)
      GPU Tracker initialization successfull
      HIP Initialisation successfull (from master)
      Rescaling buffer size limits from 20500000000 to 15000000000 bytes of memory (factor 0.731707)
      Rescaling buffer size limits from 20500000000 to 14504613376 bytes of memory (factor 0.707542)
      Using random seed 1819016860
      Loading time: 317,623 us
      Processing Event 0 in Pipeline 10 times
      Run 1 (thread 0)
      Event has 227301 8kb TPC ZS pages (version 4), 774406950 digits
      Event has 81586869 TPC Clusters, 0 TRD Tracklets
      Memory access fault by GPU node-3 (Agent handle: 0x485b2c0) on address 0x7f7023c00000. Reason: Page not present or supervisor privilege.

      ROCR_VISIBLE_DEVICES=1 ./ca -e lhc24ar_raw --gpuType HIP --memSize 15000000000 --inputMemory 6000000000 --outputMemory 10000000000 --sync -s 0 -n 1 --runs 10 --PROCdoublePipeline
      Warning in <UnknownClass::SetDisplay>: DISPLAY not set, setting it to 10.162.32.11:0.0
      Reading events from Directory lhc24ar_raw
      Using 6000000000 bytes as input memory
      Using 10000000000 bytes as output memory
      GPU Tracker library loaded and GPU tracker object created sucessfully
      Created GPUReconstruction instance for device type HIP (3)
      Created GPUReconstruction instance for device type HIP (3) (slave)
      Read event settings from dir events/lhc24ar_raw/ (solenoidBz: 5.006668, constBz 0, maxTimeBin 14257)
      Standalone Test Framework for CA Tracker - Using GPU
      HIP Initialisation successfull (Device 0: AMD Instinct MI60 / MI50 (Frequency 1725000, Cores 60), 1073741824 / 15000000000 bytes host / global memory, Stack frame 8192, Constant memory 25831)
      GPU Tracker initialization successfull
      HIP Initialisation successfull (from master)
      Rescaling buffer size limits from 20500000000 to 15000000000 bytes of memory (factor 0.731707)
      Rescaling buffer size limits from 20500000000 to 14504613376 bytes of memory (factor 0.707542)
      Using random seed -1960746364
      Loading time: 317,138 us
      Processing Event 0 in Pipeline 10 times
      Run 1 (thread 0)
      Event has 227301 8kb TPC ZS pages (version 4), 774406950 digits
      Event has 81586869 TPC Clusters, 0 TRD Tracklets
      Output Tracks: 677655 (0 / 44628611 / 0 / 81586869 clusters (fitted / attached / adjacent / total) - O2 format)
      Total Wall Time:    7058803 us
      Run 2 (thread 0)
      Event has 227301 8kb TPC ZS pages (version 4), 774406950 digits

    • 10:30 AM 10:35 AM
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)
       

      NextGenTrigger Task 1.7

      • CHEP talk together with Jolly got accepted.
      • Presenting "proof of concept" for a variation in CMS clustering algorithm CLUE tomorrow.

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Created a pull-request for O2 with the following changes:
        • With the argument --PROCtimingCSV <filename> we write the output of --debug <n> to a CSV file.
        • To avoid code duplication, I also changed the way we print to the terminal (see next section).
        • The argument --PROCresetTimers was always overwritten by --runsInit, changed that.
      • Tried also Gabriele's benchmarking tool. Worked out of the box.
      • Next Steps:
        • Put same settings and compare with Gabriele's benchmarking tool. 
        • Write minimal reproducer for the (weird) behavior I observed with AMD W7900 and custom .par file.
        • Make better use of SoA to improve performance
        • Try David's suggestion

      Example of New Terminal Output

      |   |  count | name                                      |  gpu (us) |  cpu (us) |
      |---|--------|-------------------------------------------|-----------|-----------|
      | K |    405 | GPUMemClean16                             |     68380 |           |
      | K |    144 | GPUTPCCFDecodeZSDenseLink                 |     99299 |           |
      | K |     36 | GPUTPCCFCheckPadBaseline                  |     30064 |           |
      | K |    144 | GPUTPCCFPeakFinder                        |     45810 |           |
      | K |    288 | GPUTPCCFStreamCompaction_scanStart        |      6189 |           |
      | K |    288 | GPUTPCCFStreamCompaction_scanUp           |      2883 |           |
      | K |    288 | GPUTPCCFStreamCompaction_scanTop          |      2936 |           |
      | K |    288 | GPUTPCCFStreamCompaction_scanDown         |      2721 |           |

    • 10:35 AM 10:40 AM
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 AM 10:45 AM
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)
       

      OpenCL

      No news.

      GPU Servers

      CI Server: 

      • Setting PCIe version in BIOS fixed Nvidia GPU
      • Handed node to Giulio and Sergio for CI integration
      • Will ping Sergio if he has an update

      Highly Ionizing Particles

      Ernst fixed regression in FST when running simulation.

      Opened PR for the current implementation.

      Filter Performance

       

      Checked filter performance on 50khz pbpb timeframe. Matched zeroed values to MC labels.

      Good news: Filter appears to work.

      Bad news: Doesn't seem to work very well.

      Confusion matrix
       
      Cutoff Threshold Correctly Zeroed Tails Leaked Correctly Kept False Positive Precision (# correctly zeroed / # total zeroed)
       150 ADC 0.3910 0.6090 0.9977 0.0023 0.5370
      100 ADC 0.3959 0.6041 0.9977 0.0023 0.5362
      50 ADC 0.4115 0.5885 0.9974 0.0026 0.5161
       
       
      Some example (original data), Yellow regions mark saturated ADC
       
       
       
      Data overlayed with MC (flood fill around saturated ADC)
       
      Data with filter applied
       

       

      Runtime Overhead

      • Noisy Pad Filter runs only on first fragment by default -> makes no sense when HIP filter is enabled
      • Total overhead when running with filter on all fragments: 10% walltime

      Other Points

      • Ideas to improve filter performance:
        • Start some timebins backwards from trigger to handle rising edge
        • Moving average / exponential filter for threshold to better handle noise in tail
    • 10:45 AM 10:50 AM
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Gabriele:


      Current approach:

      1. Define 3D voxelisation around beamline
      2. Parametrize tracks from first three ITS layers as p(t) = p0 + t*u
      3. Compute tenter and texit  for when the tracks enters and exits the 3D grid
      4. Sample position of the track inside voxelisation at uniform Δt
      5. Update hit count everytime a sample hits a box inside the voxelisation
      6. Select box with maximum hit count --> high multiplicity candidate
      7. Select lines passing through that voxel and perform fit (with outlier rejection)
      8. Vertex fit determines beam position
      9. For each line not used for high multiplicity fit, compute z coordinate of closest approach of line to the beam
      10. Compute 1D histogram
      11. Select peaks
      12. Fit lines contributing to peaks

       

       

      Tunable parameters:

      1. Nbin in transverse plane [30, 80]
      2. Nbin in beam direction [500, 2000]
      3. local maximum threshold [3, 50]
      4. local maximum prominence [0, 10]

       

      PbPb tf with 150 events:

      Precision: 0.9856, Recall: 0.9133, F1: 0.9481
      Total true vertices: 150
      Matched vertices: 137
      Missed true vertices: 13
      Total fitted vertices: 139
      Fake vertices: 2
      Mean residuals (x,y,z): [-3.20830489e-05 -4.17494588e-04 -2.44645401e-03]
      Std  residuals (x,y,z): [0.00785697 0.01020168 0.01786484] (still need to implement low mult fit)

      Missed true vertices:
        ROF         x         y         z  nContributors
       11.0 -0.000259  0.022460  -3.69227           10.0
       30.0 -0.024436 -0.009381   7.03565            1.0
       35.0 -0.023708 -0.009344  -5.51584            1.0
       48.0 -0.010583 -0.008603  -3.05986           14.0
       49.0 -0.011022  0.000661  -3.79249           39.0
       75.0  0.011355 -0.002970   1.90074            1.0
       90.0  0.002929 -0.018336   2.66971            1.0
       97.0  0.011611  0.000815   0.00732           28.0
      102.0  0.001209  0.020929  -4.71790            1.0
      123.0 -0.001262 -0.002698   8.83149           22.0
      128.0  0.006334 -0.001214  -5.15652           10.0
      158.0 -0.022409 -0.003296  -7.26120            3.0
      161.0  0.002261  0.001900 -10.54620            1.0

      Last week metrics:
      Precision: 0.9924, Recall: 0.8667, F1: 0.9253

      CPU performance:

      Old seeding vertexer wall time:

      Step ms
      Tracklet finding iteration 0 432.61 ms
      Tracklet selection iteration 0 297.13 ms
      Vertex finder iteration 0 23.91 ms
      Tracklet finding iteration 1 148.91 ms
      Tracklet selection iteration 1 10.94 ms
      Vertex finder iteration 1 0.32 ms
      Total 913,85 ms

       

      This implementation:

      Step ms
      Tracklet finding iteration 0 432.61 ms
      Tracklet selection iteration 0 297.13 ms
      Voxel filling 149
      Maximum finding 20
      Low mult pass 1
      Total 899,75

       

      Next steps:

      1. Implement fit for low mult vertex
      2. When Max provides the cuts, shift to using ITS cells from tracker rather then from old vertexer



    • 10:50 AM 10:55 AM
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)