Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      CHEP Abstracts: https://indico.cern.ch/event/1471803/abstracts/ Deadline Dec. 19.

      • Please upload to https://docs.google.com/document/d/1eek6kv_SqHE6b5k0KHs-6wcjHIxCKyQr6eldnhsptvY/edit?tab=t.0

       

      Sync reconstruction

      • Crash in TPC ZS decoding when receiving bad data, should check and improve protection such that we do not crash.

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • New GPU architecture selection for async in O2DPG looks good, should be merged.
      • Test with GPU GRID jobs at NERSC pending.
      • Asked DPG to run first test with ITS tracking on GPU on EPNs.

       

      GPU ROCm / compiler topics:

      • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • AMD is changing their support structure, we shall fill reports via github (which I like, then it is also better traceable). But they will no longer assign an engineer to follow up all our issues, but has to go through their normal support process. In particular, that means their compiler team might not be able to fix issues, if we do not provide a minimal reproducer.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Sergey opened a new PR with the fixes and compatibility layer in, currently fails in the CI. Must be fixed, then Matthias can continue commissioning.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542
      • New safer, dynamic cluster protection working and deployed at P2.

       

      Other topics:

      • GRID Memory monitoring: Discussed with Maksim, the problem with incorrect values vrom smaps is already fixed, since they switched to cgroup monitoring.

       

      EPN GPU Topics:

       

    • 10:20 AM 10:25 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      CCDB fetching

      • PR pending: https://github.com/AliceO2Group/AliceO2/pull/14841 (MacOS CI problem unrelated)
      • Works in full system test
        • What to do about: *mConfParam = mConfig->ReadConfigurableParam()
        • What to do about metadata -> Would be good to have in order to separate networks (e.g. with different inputs)
      • Uploaded right now: Networks from the commissioning runs. Do we need more?
      • Upload sheet: https://docs.google.com/spreadsheets/d/1BGgDFqKnvYLlCK05hn5paQsDaiiE5HrwCErekvDqTv4/edit?usp=sharing

       

      pp simulation

      • 3000 min bias events, LHC24af, 1 MHz
      • Evaluation with PbPb networks

       

      Almost no occupancy coverage with this data. No advantage to be gained on the cluster properties (regression).

      Some improvement for the qTot estimation for wide clusters (qTot / qMax large):

      Still, clusters are being rejected with higher thresholds

      Efficiency increases at region where highest cluster rejection occurs. No fake-rate improvement: Fake rate is already extremely low!

       

      Tracks are so well separated that there is no real improvement to be found

       

      Cluster error / split clusters

      10 EV, 38 kHz PbPb, 0-5% centrality enforced

      Investigation of split clusters with the NN

      • Option 1: Use all MC charges, search for maxima that have no assigned ideal label
        • Problem: This can find maxima per MC label which might not correspond to location of maxima in digits
      • Option 2: Check training data: Find all training data inputs that have class label 0 (no attached ideal cluster) but exactly one peak in the 5x5 neighbourhood with assignment (red), or multiple peaks each with assignment (blue). If the network rejects such maxima, this will correlate to a reduction in split clusters.

       

      Examples:

      • For all clusters with class label 0 (regardless of neighbouring peaks) at 0.1 threshold: 44.7% rejection. This includes looper clusters and noise peaks -> Only small drop in efficiency for split clusters: They are similarly well identifiable.
    • 10:25 AM 10:30 AM
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      • Started working on migrating parameters definitions from plain header to csv, nothing to report yet
      • Should I start gathering statistics / create a report on the performance of the GPUs we have access?
        • EPN GPUs
        • NGT GPUs
        • CI server
        • Dev server
    • 10:30 AM 10:35 AM
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Changed the SoA / AoS code to better fit ALICE O2
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
      • Performance still same (or maybe 2% slower)
      • Next Steps:
        • Check if AoS has no overhead due to the new abstraction
        • Make better use of SoA to improve performance
    • 10:35 AM 10:40 AM
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))

      For To Do:-

       

      Class TODO/FIXME Implementation/Understanding Remarks
      Base/GPUReconstructionProcessing.h namespace gpu_reconstruction_kernels // TODO: Get rid of this namespace

      Do I need to delete entire namespace including struct deviceEvent and threadContext class?

      Case-1: Removing only namespace or Case-2: Removing entire namespace. In both cases I was gettting errors during make. 

      What is logic behind this todo?

      I would like to remove the namespace completely. The point is: right now GPUReconstruction.h only forward-declares deviceEvent and threadContext, and I want to keep it like that. Perhaps you can change it to something like https://godbolt.org/z/K6h53TnPa, then we can get rid of the namespace.
      DataTypes/GPUTPCGMMergedTrackHit.h // TODO: take them directly from clusterNative header. Why are we trying to take the states directly from ClusterNative. Not all of the flag states are declared in ClusterNative.h . Thus we cannot replace all of them with those from cluster native. I tried to remove this one by one but I got make errors.

      I would like to avoid the copy and paste.
      Perhaps you can include clusterNative.h, and then reuse the defines from there via

      flagSplitPad = ClusterNative::flagSplitPad;

      TPCClusterFinder/GPUTPCCFDecodeZS.cxx for (int32_t l = 0; l < hdr->nTimeBinSpan; l++) { // TODO: Parallelize over time bins

      Compiled without problem. Performance (timing for particular kernel need to be checked.)

      This TODO is not needed. Now entire is running on GPU so OMP parallel for is not meaningful. 
    • 10:40 AM 10:45 AM
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (Goethe University Frankfurt (DE))

      Highly Ionizing Particles

      • Rewrote Pad Filter kernel for preparation.
      • Use one block per TPC row, use tiling to cache in shared memory (on GPU)
      • CPU performance is comparable
      • GPU performance improved (2x) but slightly different results
        • -> Results fixed by missing boundary checks
        • But now GPU performance degrades by several factors... 

      GPU Servers

      Waiting for last parts.

      OpenCL

      No news.

      Other

    • 10:45 AM 10:50 AM
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      ITS seeding vertexing news

      Motivation

      • Algorithm that attempts to find an approximation of vertices before tracking
      • Used by ITS tracking to reduce combinatorics (thus ITS seeding vertexer --> ITS tracking)
      • Basically tries to roughly find vertices without using full tracking

      Current state

      Algorithm uses first three layers of ITS, simplified explanation:

      1. Compute tracklets made of two clusters on consecutive layers (0-1 and 1-2)
      2. Validates tracklets between layers
      3. Extend tracklets with a straight line
      4. For each line i
        1. For each line j=i+1
          1. If line already used, skip 
          2. If Distance of Closest Approach (DCA) < cut, create a vertex, mark line as used
            1. For each line k
            2. If line is used, skip
            3. If DCA < cut, add to current vertex and mark line as used
      5. Sort all vertices by number of contributors
      6. For each "vertex cluster" k
        1. For each "vertex cluster" m=k+1
          1. if distance < cut, merge them
      7. Sort all vertices by number of contributors
      8. For each cluster k
        1. Promote the biggest one that passes some cuts as primary vertex
        2. Promote the others as vertices if they have low multiplicity and are close to the beam line

       

      • Step 1 and 2 (tracklet creation and tracklet matching) already parallelized on CPU via TBB and ported to GPU by Felix S.
      • Rest of the vertexing is purely serial (many sequential dependencies)
      • Cannot directly parallelize
      • Result dependent on order of evaluation of the lines
        • Might miss some better associations because the lines had been already "used"
      • This algorithm should be the last step for bringing all ITS tracking to GPUs

      What to do

      • Talked also with Matteo C.
      • He tried to implement a histogram-based algorithm, he told me that it was not ideal due to too many assumptions
      • There must be other ways to do this step

      My idea

      • Use this vertexing algorithm: Jackson, David. (1997). A topological vertex reconstruction algorithm for hadronic jets. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. 388. 247-253. 10.1016/S0168-9002(97)00341-0. 
      • Basically for each track, a "gaussian tube" is computed:
      • Where r is a point in 3d space, p is the point of closest approach of track i to point r, and V is a covariance matrix to adjust the shape of the tube
      • The closest the track to a point, the greater the value (gaussian shaped function)
      • To find the vertices, compute the vertex function:
      • So high peaks in the function automatically indicates vertex candidates
      • Second term is to suppress contributions from only a line
      • Tested this function with some pp TFs. Below projection onto the transverse plane, integrating over beam line:
      • Clearly most of the non-zero regions of the function peaks around the beam line (0,0)
      • Computed the vertex function exactly at the points where the old vertexer was finding vertices, showing that indeed the function signals the presence of vertices (example for a ROF where 6 vertices where found):
      • Meanwhile, it can signal also secondary vertices (different ROF than previous plots):

      The algorithm in a nutshell

      • With this function, it is necessary to find the peaks in the 3d space, and cluster them so to identify vertices
      • High multiplicity vertices automatically signaled by high peaks
      • By tuning the covariance matrix, the shape of the tube can be optimized, and thus the shape of the vertex function (more or less sensitive to noise)
      • Since the function is > 0 when two or more lines passes closely, it is not necessary to scan the whole 3d volume
        • Just compute the positions of each vertex made from a pair of lines and cluster the close candidates
      • Algorithm parallelizable, over pair of lines
      • Every thread computes the function for a pair of lines --> high compute load (good for GPU, let's see for CPU)
        • Or even every block takes a candidate and all threads compute the vertex function in that candidate point
      • In this way every pair gets a "chance to shine"
      • Merge candidates that are close....still have to think
      • Talked also with Ruben to understand how global vertexing works --> Density Based Scan for clustering

       

       

       

       

      AOB

      asked DPG for another test of the async prod. now with memory clearing implemented. (thanks David; hopefully this goes well :))

      rewriting the tracking right now to implement the staggering (no eta on this, first have to show that it works)

    • 10:50 AM 10:55 AM
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)