Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • New GPU architecture selection for async in O2DPG looks good, should be merged.
      • Test with GPU GRID jobs at NERSC pending.
      • Asked DPG to run first test with ITS tracking on GPU on EPNs.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • No update from AMD, but AMD wrote they are reorganizing their support process, and should have a meeting in the next weeks.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
        • New tracking is actually faster in the refit, but slower in the looper following. raw has more loopers than MC --> faster in MC, slower in raw overall. Can disable looper following to gain some speed for 50 kHz.
        • TPC laser calib issue fixed, was a bug in tracking developments affecting only triggered data
        • Large memory usage in pp on the GRID - fixed.
        • Gain calib issue was due to new miscompilation.
        • Next iteration of tracking improvements in draft PR: https://github.com/AliceO2Group/AliceO2/pull/14651
      • Should get Gabriele's new parameters merged for Pb-Pb.
        • Merged, we definitely needed that speedup, Gabriele can report next week.
      • Investigating deficiencies with TPC cluster removal.
        • With the new tracking improvements, we are rejecting more clusters (reducing fakes), and such rejected clusters were not protected from removal. This led to some degradation in the async reco from CTF, since apparently such clusters were needed for the seeding. Fixed in latest O2.
        • In sync reco, we were not using the relaxed cuts to find tracks. Thus we found less tracks in async, and some tracks were not protected. That is the main reason we loose tracks in reco from CTF compared to reco from Raw.
        • Immediate measure is to use relaxed cuts in sync reco, which will increase processing time by  ~20%. Doing some tests with Ernst. Still need to decide what to do finally with cluster rejection.
        • Added more settings to O2 to steer cluster rejection, and run it in less aggressive mode.

       

      Other topics:

      • Assembled development server yesterday. Unfortunately, got only 32 GB DIMMs, and we have only 4 slots, so we'll have 128 GB not 256. But that should be OK. (We cannot return these modules and replace for 64 GB modules).
        • Mass storage disks were ordered as SAS drives, checking if we can switch to SATA, otherwise will add a cheap SAS controller.
        • Otherwise, server seems fully working. Felix can take care of installation once he is back after his PhD defense.
      • GRID Memory moniroting: If PSS is monitored, GPU memory is shown as host memory, thus our GPU jobs are reported as using so much host memory.

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      CPU benchmarking

      • --resources-monitoring 2
      • Pb--Pb, 1 simulation: 25 events
        • 10x processing per algorithmic setting
        • tpc-tracker
      CPU wall time mean [s] std[s]
      default 144.6415 0.8035
      Class: 399 (FC), 0.03;  Reg. 399 (FC) 148.6452 5.8687
      Class: 399 (FC), 0.1;    Reg. 399 (FC) 147.1118 8.466
      Class: 399 (FC), 0.03;  Reg. 399 (CNN) 154.8815 8.3
      Class: 599 (FC), 0.03;  Reg. 599 (FC) 153.6998 6.6834

       

      • More fluctuation when NN is used
      • For realistic case (399, 0.1): CPU wall time increases by ~2%, but fully compatible within uncertainty

      • Memory profiling (exemplary)

       

      Momentum vector estimation

       

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      ITS seeding vertexing news

      Motivation

      • Algorithm that attempts to find an approximation of vertices before tracking
      • Used by ITS tracking to reduce combinatorics (thus ITS seeding vertexer --> ITS tracking)
      • Basically tries to roughly find vertices without using full tracking

      Current state

      Algorithm uses first three layers of ITS, simplified explanation:

      1. Compute tracklets made of two clusters on consecutive layers (0-1 and 1-2)
      2. Validates tracklets between layers
      3. Extend tracklets with a straight line
      4. For each line i
        1. For each line j=i+1
          1. If line already used, skip 
          2. If Distance of Closest Approach (DCA) < cut, create a vertex, mark line as used
            1. For each line k
            2. If line is used, skip
            3. If DCA < cut, add to current vertex and mark line as used
      5. Sort all vertices by number of contributors
      6. For each "vertex cluster" k
        1. For each "vertex cluster" m=k+1
          1. if distance < cut, merge them
      7. Sort all vertices by number of contributors
      8. For each cluster k
        1. Promote the biggest one that passes some cuts as primary vertex
        2. Promote the others as vertices if they have low multiplicity and are close to the beam line

       

      • Step 1 and 2 (tracklet creation and tracklet matching) already parallelized on CPU via TBB and ported to GPU by Felix S.
      • Rest of the vertexing is purely serial (many sequential dependencies)
      • Cannot directly parallelize
      • Result dependent on order of evaluation of the lines
        • Might miss some better associations because the lines had been already "used"
      • This algorithm should be the last step for bringing all ITS tracking to GPUs

      What to do

      • Talked also with Matteo C.
      • He tried to implement a histogram-based algorithm, he told me that it was not ideal due to too many assumptions
      • There must be other ways to do this step

      My idea

      • Use this vertexing algorithm: Jackson, David. (1997). A topological vertex reconstruction algorithm for hadronic jets. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. 388. 247-253. 10.1016/S0168-9002(97)00341-0. 
      • Basically for each track, a "gaussian tube" is computed:
      • Where r is a point in 3d space, p is the point of closest approach of track i to point r, and V is a covariance matrix to adjust the shape of the tube
      • The closest the track to a point, the greater the value (gaussian shaped function)
      • To find the vertices, compute the vertex function:
      • So high peaks in the function automatically indicates vertex candidates
      • Second term is to suppress contributions from only a line
      • Tested this function with some pp TFs. Below projection onto the transverse plane, integrating over beam line:
      • Clearly most of the non-zero regions of the function peaks around the beam line (0,0)
      • Computed the vertex function exactly at the points where the old vertexer was finding vertices, showing that indeed the function signals the presence of vertices (example for a ROF where 6 vertices where found):
      • Meanwhile, it can signal also secondary vertices (different ROF than previous plots):

      The algorithm in a nutshell

      • With this function, it is necessary to find the peaks in the 3d space, and cluster them so to identify vertices
      • High multiplicity vertices automatically signaled by high peaks
      • By tuning the covariance matrix, the shape of the tube can be optimized, and thus the shape of the vertex function (more or less sensitive to noise)
      • Since the function is > 0 when two or more lines passes closely, it is not necessary to scan the whole 3d volume
        • Just compute the positions of each vertex made from a pair of lines and cluster the close candidates
      • Algorithm parallelizable, over pair of lines
      • Every thread computes the function for a pair of lines --> high compute load (good for GPU, let's see for CPU)
        • Or even every block takes a candidate and all threads compute the vertex function in that candidate point
      • In this way every pair gets a "chance to shine"
      • Merge candidates that are close....still have to think
      • Talked also with Ruben to understand how global vertexing works --> Density Based Scan for clustering

       

    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Workshop November 19 - 21
        • Need to prepare talk about Task 1.7.
        • Collected summaries from people working on Task 1.7 last week
      • Workshop November 6
      • Discussing implementation of our template SoA code in MadGraph with Stefan Roiser
        • Had another discussion with Daniele Massaro about this 
        • Next Steps:
          • Help Daniele Massaro by turning the first class into SoA (beginning of November)

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo
      • Simplified the AoS-SoA wrapper code
      • Started to apply our AoS-SoA code to: 
        • Merger
          • GPUTPCGMSectorTrack
          • GPUTPCGMTrackParam 
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTrack
      • Started with SectorTracker
        • Get same results (same GPU.out files)
        • Execution time was the same =(
          • Changed more classes to SoA
          • Running with large event set (thanks to Gabriele)
          • Performance still same (or maybe 2% slower)
      • Found a solution for
        • Explicit instantiations
        • GPURestrict()
      • Fixed CI-Piplines: They now fail if O2 standalone yields different results
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))
    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      no news

    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)