Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
      • Dev2 server available with MI210 and ROCm 7.2, MI50 and MI210 running but need further validation, MI100 randomly crashing but with different problem and not serialization issue.
      • Seeing a crash on RTX6000 NVIDIA GPU with some data sets, must be either a bug in our code or on the NVIDIA side.
      • New GPU Builder Container with CUDA 13.1.1 available, supporting GCC 15.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Maps now yielding correct results, but 1.5x performance regression running on GPUs. Must be investigated.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542

       

      Other topics:

      • Molex connectors and cables for GPU CI server arrived, need to assemble them.
      • All GPU users must update to aliBuild 1.17.40, disabling GPU builds on old aliBuild versions to avoid random build errors without proper error message.
      • Major improvement of GPU CMake and parameter loading:
        • Instead of detecting individual architectures, now performing >= comparison on compute capabilities, and if no tuning for exact architecture available, take parameters for the closest previous architectures where we have tuned values.
        • Support building O2 with database of tuned parameters provided in CSV format.
        • Support to merge on-the-fly multiple database files in CSV and JSON format.
        • Can create binary .par files for loading at RTC from CSV and JSON file with simple script.
        • Sped up GPU CMake from ~2sec to ~0sec.
      • Switching from hard-exporting ALIBUILD_O2_FORCE_GPU=1 in GPU builder container, moved to env files of CI jobs and to default of jenkins builders where GPU is needed, and can e.g. be disabled for dataflow defaults CI jobs / jenkins builders.
        • Everyone using the slc9-gpu-builder container, or Jenkins to build with GPU, please note that to get the old behavior ALIBUILD_O2_FORCE_GPU=1 must be exported.
        • Note that by default, GPUs will be autodetected, so all backends would be available, but builds would be to the fallback architectures if no GPU is detected (MI50 for AMD; sm_75-virtual for NVIDIA), and not to our default list of production architectures (including MI100, RTX Pro 6000, ...).

       

      EPN GPU Topics:

       

    • 10:20 10:25
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster error parameterisation

       

      Goal: Train the cluster errors to approximate gaussian error around track tube for any given cluster

      • First idea: Cluster error = sqrt((cluster-to-track residual)**2 - track covariance) -> If track covariance gets larger than residual, we have a problem
      • Improved idea for loss function
        • Assume gaussianity of cluster-to-track residuals and cluster errors: Maximises probability of gaussian by tuning standard deviation
        • p(d, S) = 1/sqrt(2\pi S) exp(-d^2/(2S)) -> d = cluster-to-track residual ; S = (track cov. + cluster_error^2)
        • Maximize p(d, S) == Minimize -ln(p(d, S)) -- Minimze: Loss = 0.5*ln(S) + d^2/(2S) where S = (track cov. + NN output)
      • Input for NN: 
        ["clusterState", "xx", "yy", "zz", "cluster.getSigmaPad()", "cluster.getSigmaTime()", "mP[2]", "mP[3]", "mP[4]", "mC[0]", "mC[2]", "mC[5]", "mC[9]", "mC[14]"]
        • Change clusterState to individual inputs of 0 and 1's for the input by using (clusterState >> 0 & 1), (clusterState >> 1 & 1), ... as input to avoid large values and better distinction

       

      • NN implemented in reco chain:
        • Simulation with standard reco: 388k tracks
        • Simulation with new NN cluster errors (otherwise identical): 120k tracks
          • Also tried scaling the error but still far off

       

      -> Something's fishy...

      • Problem are clusters for which the learned cluster error is 0
      • Potential problem is the training data selection: https://github.com/ChSonnabend/AliceO2/blob/7208318395ce370d4ba522d05e1d31b00ebe29a3/GPU/GPUTracking/Merger/GPUTPCGMTrackParam.cxx#L305

       

      New implementation

      • Removed ReLU in final output layer -> Removes entries at prediction = 0
      • Top default reco, bottom with new cluster error NN
      • 388k tracks (default) vs. 392k tracks (new NN)

       

    • 10:25 10:30
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      See table 

      In synthesis: finished sim PbPb for RTX6000, started tuning dev pdp server

    • 10:30 10:35
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Implementing our SoA code in MadGraph, together with
        • Stefan Roiser
        • Daniele Massaro
      • No news

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
          • GPUTPCTrack
      • Went back from CRTP to classical inheritance
      • Implemented new way of providing iterators in O2: https://godbolt.org/z/haas6YP6c
      • Next Steps:
        • Adapt unit test to the changes
        • Adapt benchmarks to the changes
        • Make better use of SoA to improve performance
        • Try David's suggestion
    • 10:35 10:40
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 10:40 10:45
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      pdp-dev00 and epn100 in same slurm partition. Access for both protected via egroup for ssh and slurm.

      TODO: Add documentation to PDP docs.

      Highly Ionizing Particles

      Tail masking almost done. Currently working on testing implementation.

    • 10:45 10:50
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Felix: no news, besides noticing that in the Pb-Pb production we did not use the mean-vertex as a constraint (added in #15054)

      Gabriele:

      • Tested the vertex function on simulated PbPb data
      • Reminder: vertex function gives a score for each point in the ITS volume based on
      • Where fi is
      • Where i is i-th track from 3 first three inner layers, r-p is the distance of the track to the point
      • Currently tested this strategy: voxelize the 3d volume around the beampipe:
        const int nZ = 2000;
        const int nX = 40;
        const int nY = 40;
        const float xMin = -.3f, xMax = .3f;
        const float yMin = -.3f, yMax = .3f;
        const float zMin = -11, zMax = 11;
      • Compute the vertex function for each point (i.e. sum over all tracks)

       

      Simulated PbPb data

      max over x,y of V(x,y,z) vs z, vertical line is z of true vertex in ReadOut Frame (ROF) 31:
      Clear peak of the function around the true vertex

      So the function can carry the signal of the presence of a vertex in PbPb conditions. However, for the next ROF n. 32

      In this ROF there where no vertices, but the signal of previous vertex is still present (even if with less intensity). Also the old seeding vertex was finding a vertex there. Is this a known problem?

      Pitfalls

      • On CPU extremely heavyweight to compute:

        Old CPU implementation, 20 threads, vertexing part serial:
        Vertex seeding total elapsed time: 1666.8417 ms for 133 (320 + 85) vertices found in 36/191 ROFs

        Vertex function on CPU with 20 threads, computing the vertex function for a single ROF:
        Celled vertex function time: 1811.853949 ms

      • Result dependent on binning

      Next steps

      • Since the vertex function seems informative, try on GPU and see the performance results
      • In case of good performances, try to proceed in this way
      • Can reduce number of computations by calculating the function only for points where pair of tracks passes by
      • Select tracks with high contribution
      • Perform a fit of the vertex with that tracks
    • 10:50 10:55
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)