Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
    • 10:20 AM 10:25 AM
      Following up JIRA tickets 5m
      Speaker: Ernst Hellbar (CERN)
      Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226
      • Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
      • Merged workflow fails if outputs defined after being used as input
        • needs to be implemented by Giulio
      • Cannot override options for individual processors in a workflow
        • requires development by Giulio first 
      • Problem with 2 devices of the same name
      • Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
      • Run getting stuck when too many TFs are in flight.
      • Do not use string comparisons to derrive processor type, since DeviceSpec.name is user-defined.
      • Support in DPL GUI to send individual START and STOP commands.
      • Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
      • Implement a proper solution to detect wheter a device is firstInChain 
      • Deploy topology with DPL driver

       

      PDP-SRC issues
      • Check if we can remove dependencies on /home/epn/odc/files in DPL workflows to remove the dependency on the NFS
        • reading / writing already disabled
        • remaining checks for file existence?
        • check after Pb-Pb by removing files and find remaining dependencies
      • logWatcher.sh and logFetcher scripts modified by EPN to remove dependencies on epnlog user
        • node access privileges fully determined by e-groups
        • new log_access role to allow access in logWatcher mode to retrieve log files, e.g. for on-call shifters
        • to be validated on STG
        • waiting for EPN for further feedback and modifications of the test setup
      • computing estimate for 2024 Pb-Pb
        • originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
          • 11 % difference 
          • estimate from 2023 Pb-Pb replay data with 2024 software
        • average hadronic interaction rate of Pb-Pb replay timeframes with pile-up correction for ZDC rate
          • formula: IR_had = -ln(1 - rate_ZDC / (11245*nbc) ) * 11245 * nbc * 7.67 / 214.5
          • 2023, 544490, nbc=1088, rate_ZNC=1166153.4 Hz: IR_had = 43822.164 Hz
          • 2024, 560161,  nbc=1032, rate_ZNC=1278045.2 Hz: IR_had = 48417.767 Hz
          • 10.5 % difference in IR from 2023 relpay to 2024 replay
          • 7 % to 47 kHz assumed for the 2023 replay data (?) when estimating the required resources
              • could at least explain part of the difference between the estimated and observed margins
      • environment creation
        • cached topologies
          • in practice, only works when selecting only one detector or when defining the Detector list (Global)  specifically in the EPN ECS panel
            • when using default, the list of detectors is taken from default variables in ECS
              • not yet clear where this is set, it obviously depends on the selected detectors
              • the order of detectors is always different, even for identical environments, therefore, the topology hash is also different and the cached topologies are not used
              • investigating together with ECS team
        • start-up time
          • ~80 sec spent in state transitions from IDLE to READY 
          • will profile state transitions with export DPL_SIGNPOSTS=device  to determine if we wait for single slow tasks or if some other part (e.g. DDS) is slow
    • 10:25 AM 10:30 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
    • 10:30 AM 10:35 AM
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)
       
      ITS GPU tracking
      • General priorities:
        • Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
        • Optimizations via intelligent scheduling and multi-streaming can happen right after.
        • Kernel-level optimisations to be investigated.
      • Move remaining track-finding tricky steps on GPU
        • ProcessNeighbours kernel has been ported and validated.
        • Now fixing the logic that concatenates its usage multiple times as it is still faulty.
      • TODO:
        • Reproducer for HIP bug on multi-threaded track fitting: no progress yet.
        • Fix possible execution issues and known discrepancies when using gpu-reco-workflow : no progress yet; will start after the tracklet finding is ported.
      DCAFitterGPU
      • Deterministic approach via using SMatrixGPU on the host, under particular configuration: no progress yet.
    • 10:35 AM 10:45 AM
      TPC Track Model Decoding on GPU 10m
      Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
    • 10:45 AM 10:55 AM
      Efficient Data Structures 10m
      Speaker: Dr Oliver Gregor Rietmann (CERN)

      Efficient Data Structures

      In order to onboard Jolly (collaborator from ROOT) on the code more easily, I did the following:

      • Added unit tests
      • Added diagnostic functions (e.g. to count number of constructor calls, copies, moves, ...)
      • Set up easy benchmarks (still ongoing).