Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 10:20
      Discussion 20m
      Speakers: David Rohr (CERN), Giulio Eulisse (CERN)

      Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

      High priority Framework issues:

      • Start / Stop / Start: 2 problems on O2 side left:
          • All processes are crashing randomly (usually ~2 out of >10k) when restarting. Stack trace hints to FMQ. https://its.cern.ch/jira/browse/O2-4639
          • TPC ITS matching QC crashing accessing CCDB objects. Not clear if same problem as above, or a problem in the task itself:
      • Stabilize calibration / fix EoS: New scheme: https://its.cern.ch/jira/browse/O2-4308:
        • O2 with new EndOfStream deployed at P2. working so far, but didn't test in detail. Spotted also one minor problem that apparently if processes hit the stop-transition-timeout during STOP, they segfault during SHUTDOWN.
        • Still need to test that it keeps running the calibration after the data processing timeout. RC will run set of tests with different time outs.
        • PR with InfoLogger improvements still WIP.
      • Fix problem with ccdb-populator: no idea yet - since Ole left, someone else will have to take care.
      • TF-status message (from https://github.com/AliceO2Group/AliceO2/pull/13495) sent by readout-proxy. Status?
      • ONNX Update Status?

       

      Sync reconstruction

      • Waiting for RC to test COSMIC replay data set.
      • Waiting for RC to test STOP timeout impact.
      • SW update had issues due to bug in QC, now scheduled for tomorrow.

       

      Async reconstruction

      • Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes. Checking 2 things:
        • does the situation get better without GPU monitoring? --> Inconclusive
        • We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
      • ḾI100 GPU stuck problem will only be addressed after AMD has fixed the operation with the latest official ROCm stack.
        • MI100 tests on async looked OK, though it was only ~15 jobs. All of them went through without a long GPU stall. Though there were short GPU stalls, which we also see in sync. To be investigated.
      • Limiting factor for pp workflow is now the TPC time series, which is to slow and creates backpressure (costs ~20% performance on EPNs). Enabled multi-threading as recommended by Matthias - need to check if it works.

       

      AliECS related topics:

      • Extra env var field still not multi-line by default., created https://its.cern.ch/jira/browse/OGUI-1624 to follow this up seperately from other tickets.

       

      GPU ROCm / compiler topics:

      • List of important issues with AMD:
        • Issues that disappeared but not yet understood: random server reboot with alma 9.4, miscompilation with ROCm 6.2, GPU getting stuck when DMA engine turned off, MI100 stalling with ROCm 5.5.
          • Leaving this here in case at some point it reappears, or we want to try to understand the root cause.
        • Random crashes on MI100 due to memory error, can be worked around by serializing all kernel and DMA transfers, which has 20% performance degradation.
          • EPN will deploy the fix by AMD with the SW update on Thursday. Then we can remove the automatic workaround. Will only be merged into a later official ROCm release.
        • Discussed with AMD the meaning of a couple of workarounds we are applying. Once we are stable, should remove some and see if it is still needed.
          • HIP_DEVICE_MALLOC workaround could be removed, not needed any more. NO_SCRATC_RECLAIM workaround also removed, with new ROCm was even degrading perfromance. NO_LOWER_LDS compiler option must remain, helps amdgpu-function-calls and improves performance.
      • SLC9 GPU container in Jenkins working and RPM publishing fixed. Set up as secondary container for online builds (since we still need the old container for the infra node) and switched to a unified SLC9 container for offline builds (can run on the GRID and on the EPNs with GPU).
      • Created GenTopo (topology generation) build for EL9, to be used by EPN once they move their infra nodes to EL9.
      • Try to find a better solution for the problem with __device__ inline functions leaking symbols in the host code.
      • Improved GPU standalone build, so we can build with sanitizers in optimized mode, and with sanitizers using clang. Revealed a couple of minor issues, which were not affecting the processing. Fixed.
      • This was used to understand a long-pending issue with random Floating Point Exceptions when running TPC tracking on CPU. The problem is auto-vectorization of the compiler, which uses only 2 of the 4 SSE lanes, leading to computation of random data in the remaining lanes, which can trigger an FPE. (Random means, the compiler even initializes the data to 0, but that eventually leads to computation of sqrt(-inf). It seems there is no way to fix this. As a workaround, by default I'll disable FPE traps in the standalone benchmark when compiler with -ffast-math. (In the code I can only check for -ffast-match, not for -ftree-vectorize, which would be the correct check. But checking for fast-math might be enough, since it makes the problem much more likely.
      • GCC bumped to 14.2
      • Once we bump arrow (PR opened by Giulio), we can bump LLVM to 19.

       

      TPC / GPU Processing 

      • WIP: Use alignas() or find a better solution to fix alignment of monte carlo labels: https://its.cern.ch/jira/browse/O2-5314
      • Waiting for TPC to fix bogus TPC transformations for good, then we can revert the workaround.
      • Waiting for TPC to check PR which uses full cluster errors including average charge and occupancy map errors during seeding.
      • Final solution: merging transformation maps on the fly into a single flat object: Still WIP
      • Pending OpenCL2 issues:
        • printf not working due to confirmed bug in clang, fix is being prepared. Prevents further debugging for now.
        • Crash in merger, which can be worked around by disabling clang SPIRV optimization. Probably bug in clang, but need to fix printf first to debug.
        • Also with optimization disabled, crashing later in TPC merging, need printf to debug.
        • Felix debugged the OpenCL clusterization problem to be due to off-by-one offset in NoiseSuppression. Need to check how that can happen only in OpenCL.
      • Next high priority topic: Improvements for cluster sharing and cluster attachment at lower TPC pad rows.
      • Improved code-generation and application of DETERMINISTIC mode flags, such that GPU RTC can enable the deterministic mode and no_fast_math flags in the code it compiles. Now also yields 100% same results as the CPU in deterministic mode.
        • However, originally planned to have deterministic mode a runtime flag only if RTC is used. Tunrs out this won't work, since configuration parameters might be rounded when scaled on the host before being passed on to the GPU. I.e. unavoidable to recompile the host code in deterministic mode.
      • Switched from thrust library to CUB library for sorting using the full GPU device. Thrust was adding unnecessary synchronizations. I patched them away in CUDA's thrust, but never had time to do the same in HIP thrust, and my CUDA patch didn't work any more with the latest CUDA. So switching to CUB seems the simplest solution. Time per TF reduced from 4.1 to 4.0 seconds on my NVIDIA GPU, unfortunately no improvements on MI50/MI100.
      • With RTC able to compile with NO_FAST_MATH, working to fix the issue that some clusters fail track-model decoding if RTC enabled due to floating point rounding.
        • We had fixed this without RTC using per-kernel compilation and per-kernel compile flags, but so far this was not possible with RTC.
      • Next for RTC: Ability to use different optimization parameters per compile, so we can use different optimized settings for MI50 and MI100, and we should be able to load settings at runtime for RTC, quite helpful for debugging and tuning, particularly for Gabriele.

       

      Other Topics

      • Quest position: 2 Interviews yesterday, 4 more interviews today, will decide end of the week on the candidate.

       

      EPN major topics:

      • Fast movement of nodes between async / online without EPN expert intervention.
        • 2 goals I would like to set for the final solution:
          • It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
          • We must not lose which nodes are marked as bad while moving.
      • Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
        • Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
      • Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
      • DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
      • TfBuilders should stop in ERROR when they lose connection.
      • Allow epn user and grid user to set nice level of processes: https://its.cern.ch/jira/browse/EPN-349
      • Updated ODC again, should now work with ALMA 9.5 on the infra nodes.

       

      Other EPN topics:

       

    • 10:20 10:25
      Following up JIRA tickets 5m
      Speaker: Ernst Hellbar (CERN)
    • 10:25 10:30
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Since last week

      • As per request, added raw efficiency, clone- and fake-rate plots to QA output
      • New PR with NN implementation: https://github.com/AliceO2Group/AliceO2/pull/14069
        • CCDB interface (tested with test-ccdb and working)
        • Cosmetic changes
      • Google sheet to keep track of uploaded networks: https://docs.google.com/spreadsheets/d/1BGgDFqKnvYLlCK05hn5paQsDaiiE5HrwCErekvDqTv4/edit?gid=976920847#gid=976920847
        • Added commands for upload / deletion of networks based on CCDB metadata flags
        • Added link to test-ccdb that shows the network

       

    • 10:30 10:35
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)
      ITS GPU tracking
      • General priorities:
        • Providing GPU ITS tracking alongside TPC via gpu-reco-workflow in full system test: few changes in fst script, but it does not work yet, due to different gpu-wf configurations wrt to my custom script. Investigating.
        • Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
          • Tracking fully ported on GPU.
          • Moving vertexing routines to the externally managed memory system. -> WIP
            • Currrently optimising the o2::its::timeframeGPU intialisation GPU transfers, trying to maximise data reuse & anticipating loads when it is possible.
      • Optimizations:
        • Asynchronous parallelisation in the tracklet finding, i.e. Multi-streaming for obvious parallelisations.
        • intelligent scheduling and multi-streaming can happen right after.
        • Kernel-level optimisations to be investigated.


      TODO:

        • Thrust allocator with external memory management -> possibly the most critical missing piece, to find a decent way of introducing it.
        • Reproducer for HIP bug on multi-threaded track fitting: no follow-up yet.
        • Fix possible execution issues and known discrepancies when using gpu-reco-workflow: no progress.
      DCAFitterGPU
      • Deterministic approach via using SMatrixGPU on the host, under particular configuration: no progress.

       

       

       

       

       

    • 10:35 10:45
      TPC Track Model Decoding on GPU 10m
      Speaker: Gabriele Cimador (Universita e INFN Torino (TO))

      Some problems

      • Tried to plug in new optimum parameters for 6 different kernels after grid search
      • Parameters should be optimal for 50kHz IR
      • Turns out it sync time is almost 7% slower with new parameters
      • In grid search, kernels measured individually using rocprof
      • In overall evaluation, total sync and async time considered
      • Maybe there are hidden slowdowns which have not been measured during the grid search?
      • I will measure kernel durations to see which one is slower w.r.t. data taken during the grid search
      • In case try to do individual grid searches and see if optimal parameters change
      • In case try to find another way to measure kernel durations instead of using ROCM profiler

      Idea for a more efficient grid search:

      1. Apply a Latin Hypercube Sampling
        1. Divide each dimension of the search space in M intervals (bins)
        2. Sample N points s.t. each interval (bin) has only one sample point


        3. This way the search space should be explored evenly
      2. Select the configuration with the best result
      3. Recursively apply LHS in a more fine-grained search space around that sample

      Important question

      Should I try this type of optimisation or should I just try to apply a known external optimsation framework and somehow adapt it to this problem?

      Main concern

      • SliceTracker step has 8 main kernels, 16 parameters --> 16 dimension search space (w.r.t. 2 dimension of independent kernels)
      • Each time 15 minutes are taken to compile the standalone benchmark to evaluate a sample point in the search space
      • Will this euristic be enough to have feasible runtime of the search space?
      • External frameworks might be faster?

      Other questions

      • In GPUDefGPUParameters.h kernel parameters are defined like that:
        • #define GPUCA_LB_GPUTPCCompressionKernels_step0attached 192, 2 where 192 is block_size and grid_size is 2 * Available Compute Units
          How can I define a grid_size which is not multiple of Compute Units?
      • How compression mode is selected? When doing grid search, I get this result from step 0 of compression kernels:

        The problem arises becuase the default configuration, i.e. the dashed cell, is 1.3 slower than the same configuration outside the grid search. This problem arises only for this kernel. I suspect that the compression mode is changed, may it be true? For comparison, with another kernel:

        Here default configuration is correctly around 1
    • 10:45 10:55
      Efficient Data Structures 10m
      Speaker: Dr Oliver Gregor Rietmann (CERN)