Alice Weekly Meeting: Software for Hardware Accelerators

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:30 10:50
      Discussion 20m
      Speaker: David Rohr (CERN)

      Color code: (criticalnews from this week: blue, news from last week: purple, no news: black)

      Sync reconstruction

       

      Async reconstruction

      • Need to investigate short GPU stall problem.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.
      • Test with GPU GRID jobs at NERSC pending.
      • Will tune existing 16-core settings, add a SITEARCH for 16core CPU, and 16coreCPU + generic NVIDIA / AMD GPU, like for 8 core.
      • Will retune EPN async workflow for TPC + ITS on GPU on 2025 data.

       

      GPU ROCm / compiler topics:

      • Problem with building ONNXRuntime with MigraphX support.
      • Need to find a way to build ONNXRuntime with support for CUDA and for ROCm.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Miscompilation / internal compiler error fixed in new clang for ROCm 7.x, SDMA engine synchronization bug still not fixed.
        • Serialization bug pending.
        • Miscompilation on MI 100 leading to memory error pending. 
        • New miscompilation on MI 50 with ROCm 7.0 when RTC disabled.
        • New miscompilation on MI 50 on ROCm 6.3 and 7.0 when RTC enabled, with latest software. Have a workaround for Pb-Pb data taking, but not compatible to latest tracking developments.
        • Waiting for ROCm 7.2, which could fix the MI100 serialization issue for good. Not clear yet with regards to miscompilation problems.
      • Dev2 server available with MI210 and ROCm 7.2
        • Runs stable on MI50 and MI210 GPUs, didn't check correctness yet.
        • Serialization issue on MI100 seems fixed, or at least does no longer trigger immediately.
        • Instead, having other random crashes on MI100 now.
      • Seeing a crash on RTX6000 NVIDIA GPU with some data sets, must be either a bug in our code or on the NVIDIA side.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Final solution: merging transformation maps on the fly into a single flat object:
        • Sergey is checking. Differente in treatment outside of the measured region. New treatment implemented in the old maps, to check if that causes the differences.
      • Need to check the problem with ONNX external memory allocator.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows. PR: https://github.com/AliceO2Group/AliceO2/pull/14542

       

      Other topics:

      • Molex connectors and cables for GPU CI server arrived, need to assemble them.
      • Removed bogus setting of rocm path in LD_LIBRARY_PATH by ONNXRuntime recipe. Seems to have no side effects.
      • Build failures without clear error when using old aliBuild versions. Added a feature to query the version to aliBuild, will disable all GPU builds for old versions ins the future.

       

      EPN GPU Topics:

       

    • 10:50 10:55
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Cluster error parameterization

       

      Current approach:

      • O2/GPU/GPUTracking/Merger/GPUTPCGMTrackParam.cxx:
        prop.GetErr2(err2Y, err2Z, param, zz, cluster.row, clusterState, cluster.sector, time, invAvgCharge, invCharge);
        #ifndef GPUCA_GPUCODE
        fprintf(fpdumperr, "%d,%d,%f,%f,%d,%f,%f,%f,%f,%f,%f,%f,%f,%f,%f,%f,%f,%f\n", iTrk, cluster.num, err2Y, err2Z, clusterState, xx, yy, zz, mP[0], mP[1], mP[2], mP[3], mP[4], mC[0], mC[2], mC[5], mC[9], mC[14]);
        #endif
      • For every dumped cluster: Parse the cluster.num coloumn and only keep the latest cluster
      • Training the NN:
        • labels_x = ["clusterState", "xx", "yy", "zz", "mP[2]", "mP[3]", "mP[4]", "mC[0]", "mC[2]", "mC[5]", "mC[9]", "mC[14]"]
          labels_y = ["yy", "zz", "mP[0]", "mP[1]"]
        • data_Y[0] = data_Y["yy"]**2 - data_Y["mP[0]"]**2
        • data_Y[1] = data_Y["zz"]**2 - data_Y["mP[1]"]**2
      • Normalized mC[0] and mC[2] using log10() because their values where O(10^4) to O(10^7)
      • Training pipeline is set up but training doesn't converge well yet...

       

       

      Checking the distributions

       

    • 10:55 11:00
      GPU Parameter Optimizations 5m
      Speaker: Gabriele Cimador (CERN, Università and INFN Torino)

      News from GPU parameter tuning

      • Set up tuning for RTX 6000
      • Crash with lhc24ar still present
      • Currently tuning with simulated PbPb

      New GPU parametrization

      • PR online 🎉
    • 11:00 11:05
      Efficient Data Structures 5m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      NextGenTrigger Task 1.7

      • Implementing our SoA code in MadGraph, together with
        • Stefan Roiser
        • Daniele Massaro
      • No news

      Implement NGT SoA Code in O2 standalone benchmark

      • Working on this fork of the AliceO2 repo, with a CI pipeline:
        • Running on NGT hardware with 4 different GPUs (Nvidia and AMD)
        • Extended CI-pipline to fail if GPU.out changes
      • Implemented SoA in:
        • SectorTracker
          • GPUTPCBaseTrackParam
          • GPUTPCTrackParam
          • GPUTPCTracklet
      • Maybe go back from CRTP to classical inheritance
      • New way of providing iterators:
      • Next Steps:
        • Make better use of SoA to improve performance
        • Try David's suggestion
    • 11:05 11:10
      Following up GPU to-dos 5m
      Speaker: Dr Vikas Singhal (Department of Atomic Energy (IN))
    • 11:10 11:15
      TPC Clusterization / OpenCL / Highly Ionizing Particles 5m
      Speaker: Felix Weiglhofer (CERN)

      OpenCL

      No news.

      GPU Servers

      Dev machine: Integrated in staging slurm (epn-infra13). Separate partition from epn-100

      Highly Ionizing Particles

      Tail masking almost done. 

    • 11:15 11:20
      ITS Tracking 5m
      Speakers: Felix Schlepper (CERN, Heidelberg University (DE)), Gabriele Cimador (CERN, Università and INFN Torino), Matteo Concas (CERN)

      Felix: Last PR contained some bug, will probably revert for now, tests crashed on all jobs on the epns. Hard to reproduce though, get a single sporadic invalid read for processing 300 TFs (with compute-sanitizer memcheck/racecheck). Have to think about a better strategy.

      Gabriele: Fixed bug in trackleting (or at least now tracklets are found), will resume work on ITS vertexing

    • 11:20 11:25
      System Run Coordination Topics 5m
      Speaker: Ernst Hellbar (CERN)