Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
Start: 2024-12-18T10:00:00+01:00
End: 2024-12-18T11:30:00+01:00
Location: No location set

Wednesday 18 Dec 2024, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  
  Discussion 20m
  
  Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
- 10:20 → 10:25
  Following up JIRA tickets 5m
  
  Speaker: Ernst Hellbar (CERN)
  Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226
  
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  
  Merged workflow fails if outputs defined after being used as input
  
  needs to be implemented by Giulio
  
  Cannot override options for individual processors in a workflow
  
  requires development by Giulio first
  
  Problem with 2 devices of the same name
  
  Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  
  Run getting stuck when too many TFs are in flight.
  
  Do not use string comparisons to derrive processor type, since DeviceSpec.name is user-defined.
  
  Support in DPL GUI to send individual START and STOP commands.
  
  Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
  
  Implement a proper solution to detect wheter a device is firstInChain
  
  Deploy topology with DPL driver
  
  PDP-SRC issues
  
  Check if we can remove dependencies on /home/epn/odc/files in DPL workflows to remove the dependency on the NFS
  
  reading / writing already disabled
  
  remaining checks for file existence?
  
  check after Pb-Pb by removing files and find remaining dependencies
  
  logWatcher.sh and logFetcher scripts modified by EPN to remove dependencies on epnlog user
  
  node access privileges fully determined by e-groups
  
  new log_access role to allow access in logWatcher mode to retrieve log files, e.g. for on-call shifters
  
  to be validated on STG
  
  waiting for EPN for further feedback and modifications of the test setup
  
  computing estimate for 2024 Pb-Pb
  
  originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
  
  11 % difference
  
  estimate from 2023 Pb-Pb replay data with 2024 software
  
  average hadronic interaction rate of Pb-Pb replay timeframes with pile-up correction for ZDC rate
  
  formula: IR_had = -ln(1 - rate_ZDC / (11245*nbc) ) * 11245 * nbc * 7.67 / 214.5
  
  2023, 544490, nbc=1088, rate_ZNC=1166153.4 Hz: IR_had = 43822.164 Hz
  
  2024, 560161, nbc=1032, rate_ZNC=1278045.2 Hz: IR_had = 48417.767 Hz
  
  10.5 % difference in IR from 2023 relpay to 2024 replay
  
  7 % to 47 kHz assumed for the 2023 replay data (?) when estimating the required resources
  
  could at least explain part of the difference between the estimated and observed margins
  
  environment creation
  
  https://its.cern.ch/jira/browse/O2-5629
  
  cached topologies
  
  in practice, only works when selecting only one detector or when defining the Detector list (Global) specifically in the EPN ECS panel
  
  when using default, the list of detectors is taken from default variables in ECS
  
  not yet clear where this is set, it obviously depends on the selected detectors
  
  the order of detectors is always different, even for identical environments, therefore, the topology hash is also different and the cached topologies are not used
  
  investigating together with ECS team
  
  fix in Controls and ECS to provide an alphabetically ordered detector list
  
  topology hashes are now identical for identical environments
  
  speed up (in STG) going from 22 s the first time running the topology generation scripts to 5 s for the second time using the cached topology
  
  https://its.cern.ch/jira/browse/O2-5629?focusedId=6637562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-6637562
  
  start-up time
  
  ~80 sec spent in state transitions from IDLE to READY
  
  will profile state transitions with export DPL_SIGNPOSTS=device to determine if we wait for single slow tasks or if some other part (e.g. DDS) is slow
  
  Summary of time spent in state transitions
  
  Text file with summary information: /home/ehellbar/env_creation_profiling/profile_transitions_2rYE2tBcysz/2rYE2tBcysz/transition_times_sorted.txt
  
  Starting FairMQ state machine to IDLE
  
  total of 35 s
  
  devices are started one by one, so timestamps are increasing device by device
  
  time between the last start of a device and the first initialization of a device is 15 s
  
  so DDS spents 20 s to send the start-up command to all the tasks
  
  IDLE to INITIALIZED
  
  25 s for GPU RTC, all tasks waiting for it to finish
  
  in the shadow of GPU RTC, qc tasks themselves take up to 15 s for the Init callback to initialize the CcdbApi
  
  DEVICE READY to READY
  
  total of 10 s (number obtained from InfoLogger messages)
  
  6 - 9 s for shm mapping in gpu-reconstruction
  
  DDS waiting for all tasks to complete a transition at following steps (excluding steps where effectively 0 time is spent)
  
  until IDLE
  
  until INITIALIZED
  
  until READY
- 10:25 → 10:30
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
  Items on agenda:
  
  Started writing the thesis: https://www.overleaf.com/read/hcvqgpxnjqnz#2cd750
  
  Simulations for different IRs done, always 50 Events. Using this to evaluate NN performance for different IRs:
  
  PbPb: 10, 30, 50 kHz
  
  pp: 100, 500, 1000, 1500 kHz
  
  Current developments:
  
  dE/dx tuning:
  
  Obtained workflow by Jens. Calibration object being produced and loaded, but no effect on tpcptracks.root observed yet. Investigating.
  
  Lambda / K0S reconstruction efficiency:
  
  Obtained workflow from Ruben
  
  Asked Sandro if it is possible to inject K0S and Lambda into FST simulation (basically add the digits on-top == merge the two simulations): Technically possible, only needs minor development on simulations framework
  
  Study of NN input size:
  
  Used currently exisiting training data to extract different "input-grids" (row x pad x time): (1x5x5), (1x7x7), (3x7x7), (5x7x7) and current reference case (7x7x7)
  
  Used 7x7x7 for classification and only compared effect for changing input to regression network
  
  Observations:
  
  Smooth behaviour for 2D cases (i.e. 1x5x5 and 1x7x7) at sector boundaries, as expected
  
  No reliable momentum vector estimate for 2D, as expected (since 3D charge information is needed)
  
  3x7x7 vs 7x7x7: Phi estimate (= width of Phi distribution) worsens by around 10% across pT, Theta estimate by around 20%. But both distributions well centered across pT for both cases.
  
  qTot and dE/dx performance is basically equally as good for 2D as for 3D input. Sigma estimation actually slightly improves for 2D case
  
  1x5x5 vs 1x7x7 (and also 3D): Width of CoG-time estimate at inner radii improves with larger pad-time window (left 1x5x5, right 5x7x7). CoG-pad estimate stays the very similar.
  
  Tracking efficiency at very low pT improves for 3D input, but is overall almost identical between both 2D and 3D case
  
  Next steps:
  
  Small things: Update macro with axis label plotting and add comparison macro for all plots (to plot TGraphs of two QA outputs into one plot)
  
  Giulio: Get the alidist recipe working fully
  
  Repeat studies on all interaction rates
  
  Above mentioned: dE/dx calibration + Lambda-K0S analysis
- 10:30 → 10:35
  ITS Tracking 5m
  
  Speaker: Matteo Concas (CERN)
  ITS GPU tracking
  
  General priorities:
  
  Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
  
  Optimizations via intelligent scheduling and multi-streaming can happen right after.
  
  Kernel-level optimisations to be investigated.
  
  Move remaining track-finding tricky steps on GPU
  
  ProcessNeighbours kernel has been ported and validated.
  
  Now fixing the logic that concatenates its usage multiple times as it is still faulty. -> still WIP
  
  TODO:
  
  Reproducer for HIP bug on multi-threaded track fitting: no progress yet.
  
  Fix possible execution issues and known discrepancies when using gpu-reco-workflow : no progress yet; will start after the tracklet finding is ported.
  
  DCAFitterGPU
  
  Deterministic approach via using SMatrixGPU on the host, under particular configuration: no progress yet.
- 10:35 → 10:45
  
  TPC Track Model Decoding on GPU 10m
  
  Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
- 10:45 → 10:55
  
  Efficient Data Structures 10m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226

PDP-SRC issues

ITS GPU tracking

DCAFitterGPU