Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Europe/Zurich
Zoom Meeting ID
61230224927
Host
David Rohr
Useful links
Join via phone
Zoom URL
    • 10:00 AM 10:20 AM
      Discussion 20m
      Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
    • 10:20 AM 10:25 AM
      Following up JIRA tickets 5m
      Speaker: Ernst Hellbar (CERN)
      Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226
      • Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
      • Merged workflow fails if outputs defined after being used as input
        • needs to be implemented by Giulio
      • Cannot override options for individual processors in a workflow
        • requires development by Giulio first 
      • Problem with 2 devices of the same name
      • Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
      • Run getting stuck when too many TFs are in flight.
      • Do not use string comparisons to derrive processor type, since DeviceSpec.name is user-defined.
      • Support in DPL GUI to send individual START and STOP commands.
      • Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
      • Implement a proper solution to detect wheter a device is firstInChain 
      • Deploy topology with DPL driver

       

      PDP-SRC issues
      • Check if we can remove dependencies on /home/epn/odc/files in DPL workflows to remove the dependency on the NFS
        • reading / writing already disabled
        • remaining checks for file existence?
        • check after Pb-Pb by removing files and find remaining dependencies
      • logWatcher.sh and logFetcher scripts modified by EPN to remove dependencies on epnlog user
        • node access privileges fully determined by e-groups
        • new log_access role to allow access in logWatcher mode to retrieve log files, e.g. for on-call shifters
        • to be validated on STG
        • waiting for EPN for further feedback and modifications of the test setup
      • computing estimate for 2024 Pb-Pb
        • originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
          • 11 % difference 
          • estimate from 2023 Pb-Pb replay data with 2024 software
        • average hadronic interaction rate of Pb-Pb replay timeframes with pile-up correction for ZDC rate
          • formula: IR_had = -ln(1 - rate_ZDC / (11245*nbc) ) * 11245 * nbc * 7.67 / 214.5
          • 2023, 544490, nbc=1088, rate_ZNC=1166153.4 Hz: IR_had = 43822.164 Hz
          • 2024, 560161,  nbc=1032, rate_ZNC=1278045.2 Hz: IR_had = 48417.767 Hz
          • 10.5 % difference in IR from 2023 relpay to 2024 replay
          • 7 % to 47 kHz assumed for the 2023 replay data (?) when estimating the required resources
              • could at least explain part of the difference between the estimated and observed margins
      • environment creation
        • https://its.cern.ch/jira/browse/O2-5629 
        • cached topologies
          • in practice, only works when selecting only one detector or when defining the Detector list (Global)  specifically in the EPN ECS panel
            • when using default, the list of detectors is taken from default variables in ECS
              • not yet clear where this is set, it obviously depends on the selected detectors
              • the order of detectors is always different, even for identical environments, therefore, the topology hash is also different and the cached topologies are not used
              • investigating together with ECS team
            • fix in Controls and ECS to provide an alphabetically ordered detector list
              • topology hashes are now identical for identical environments
            • speed up (in STG) going from 22 s the first time running the topology generation scripts to 5 s for the second time using the cached topology
        • start-up time
          • ~80 sec spent in state transitions from IDLE to READY 
          • will profile state transitions with export DPL_SIGNPOSTS=device  to determine if we wait for single slow tasks or if some other part (e.g. DDS) is slow
          • Summary of time spent in state transitions
            • Text file with summary information: /home/ehellbar/env_creation_profiling/profile_transitions_2rYE2tBcysz/2rYE2tBcysz/transition_times_sorted.txt
            • Starting FairMQ state machine  to IDLE                             
              • total of 35 s
              • devices are started one by one, so timestamps are increasing device by device
              • time between the last start of a device and the first initialization of a device is 15 s
                • so DDS spents 20 s to send the start-up command to all the tasks
            • IDLE  to INITIALIZED
              • 25 s for GPU RTC, all tasks waiting for it to finish
              • in the shadow of GPU RTC, qc tasks themselves take up to 15 s for the Init callback to initialize the CcdbApi
            • DEVICE READY  to READY
              • total of 10 s (number obtained from InfoLogger messages)
              • 6 - 9 s for shm mapping in gpu-reconstruction
          • DDS waiting for all tasks to complete a transition at following steps (excluding steps where effectively 0 time is spent)
            • until IDLE
            • until INITIALIZED
            • until READY 

       

    • 10:25 AM 10:30 AM
      TPC ML Clustering 5m
      Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

      Items on agenda:

      • Started writing the thesis: https://www.overleaf.com/read/hcvqgpxnjqnz#2cd750
      • Simulations for different IRs done, always 50 Events. Using this to evaluate NN performance for different IRs: 
        • PbPb: 10, 30, 50 kHz
        • pp: 100, 500, 1000, 1500 kHz
      • Current developments:
        • dE/dx tuning:
          • Obtained workflow by Jens. Calibration object being produced and loaded, but no effect on tpcptracks.root observed yet. Investigating.
        • Lambda / K0S reconstruction efficiency:
          • Obtained workflow from Ruben
          • Asked Sandro if it is possible to inject K0S and Lambda into FST simulation (basically add the digits on-top == merge the two simulations): Technically possible, only needs minor development on simulations framework

       

       Study of NN input size:

      • Used currently exisiting training data to extract different "input-grids" (row x pad x time): (1x5x5), (1x7x7), (3x7x7), (5x7x7) and current reference case (7x7x7)
      • Used 7x7x7 for classification and only compared effect for changing input to regression network
      •  Observations:
        • Smooth behaviour for 2D cases (i.e. 1x5x5 and 1x7x7) at sector boundaries, as expected
        • No reliable momentum vector estimate for 2D, as expected (since 3D charge information is needed)
        • 3x7x7 vs 7x7x7: Phi estimate (= width of Phi distribution) worsens by around 10% across pT, Theta estimate by around 20%. But both distributions well centered across pT for both cases.
        • qTot and dE/dx performance is basically equally as good for 2D as for 3D input. Sigma estimation actually slightly improves for 2D case
        • 1x5x5 vs 1x7x7 (and also 3D): Width of CoG-time estimate at inner radii improves with larger pad-time window (left 1x5x5, right 5x7x7). CoG-pad estimate stays the very similar.

        • Tracking efficiency at very low pT improves for 3D input, but is overall almost identical between both 2D and 3D case

       

      Next steps:

      • Small things: Update macro with axis label plotting and add comparison macro for all plots (to plot TGraphs of two QA outputs into one plot)
      • Giulio: Get the alidist recipe working fully
      • Repeat studies on all interaction rates
      • Above mentioned: dE/dx calibration + Lambda-K0S analysis
    • 10:30 AM 10:35 AM
      ITS Tracking 5m
      Speaker: Matteo Concas (CERN)
      ITS GPU tracking
      • General priorities:
        • Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
        • Optimizations via intelligent scheduling and multi-streaming can happen right after.
        • Kernel-level optimisations to be investigated.
      • Move remaining track-finding tricky steps on GPU
        • ProcessNeighbours kernel has been ported and validated.
        • Now fixing the logic that concatenates its usage multiple times as it is still faulty.  -> still WIP
      • TODO:
        • Reproducer for HIP bug on multi-threaded track fitting: no progress yet.
        • Fix possible execution issues and known discrepancies when using gpu-reco-workflow : no progress yet; will start after the tracklet finding is ported.
      DCAFitterGPU
      • Deterministic approach via using SMatrixGPU on the host, under particular configuration: no progress yet.
    • 10:35 AM 10:45 AM
      TPC Track Model Decoding on GPU 10m
      Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
    • 10:45 AM 10:55 AM
      Efficient Data Structures 10m
      Speaker: Dr Oliver Gregor Rietmann (CERN)