Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
Start: 2024-12-11T10:00:00+01:00
End: 2024-12-11T11:30:00+01:00
Location: No location set

Wednesday 11 Dec 2024, 10:00 → 11:30 Europe/Zurich

61230224927

David Rohr

Join via phone

- 10:00 → 10:20
  
  Discussion 20m
  
  Speakers: David Rohr (CERN), Giulio Eulisse (CERN)
- 10:20 → 10:25
  Following up JIRA tickets 5m
  
  Speaker: Ernst Hellbar (CERN)
  Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226
  
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  
  Merged workflow fails if outputs defined after being used as input
  
  needs to be implemented by Giulio
  
  Cannot override options for individual processors in a workflow
  
  requires development by Giulio first
  
  Problem with 2 devices of the same name
  
  Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  
  Run getting stuck when too many TFs are in flight.
  
  Do not use string comparisons to derrive processor type, since DeviceSpec.name is user-defined.
  
  Support in DPL GUI to send individual START and STOP commands.
  
  Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
  
  Implement a proper solution to detect wheter a device is firstInChain
  
  Deploy topology with DPL driver
  
  PDP-SRC issues
  
  Check if we can remove dependencies on /home/epn/odc/files in DPL workflows to remove the dependency on the NFS
  
  reading / writing already disabled
  
  remaining checks for file existence?
  
  check after Pb-Pb by removing files and find remaining dependencies
  
  logWatcher.sh and logFetcher scripts modified by EPN to remove dependencies on epnlog user
  
  node access privileges fully determined by e-groups
  
  new log_access role to allow access in logWatcher mode to retrieve log files, e.g. for on-call shifters
  
  to be validated on STG
  
  waiting for EPN for further feedback and modifications of the test setup
  
  computing estimate for 2024 Pb-Pb
  
  originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
  
  11 % difference
  
  estimate from 2023 Pb-Pb replay data with 2024 software
  
  average hadronic interaction rate of Pb-Pb replay timeframes with pile-up correction for ZDC rate
  
  formula: IR_had = -ln(1 - rate_ZDC / (11245*nbc) ) * 11245 * nbc * 7.67 / 214.5
  
  2023, 544490, nbc=1088, rate_ZNC=1166153.4 Hz: IR_had = 43822.164 Hz
  
  2024, 560161, nbc=1032, rate_ZNC=1278045.2 Hz: IR_had = 48417.767 Hz
  
  10.5 % difference in IR from 2023 relpay to 2024 replay
  
  7 % to 47 kHz assumed for the 2023 replay data (?) when estimating the required resources
  
  could at least explain part of the difference between the estimated and observed margins
  
  environment creation
  
  cached topologies
  
  in practice, only works when selecting only one detector or when defining the Detector list (Global) specifically in the EPN ECS panel
  
  when using default, the list of detectors is taken from default variables in ECS
  
  not yet clear where this is set, it obviously depends on the selected detectors
  
  the order of detectors is always different, even for identical environments, therefore, the topology hash is also different and the cached topologies are not used
  
  investigating together with ECS team
  
  start-up time
  
  ~80 sec spent in state transitions from IDLE to READY
  
  will profile state transitions with export DPL_SIGNPOSTS=device to determine if we wait for single slow tasks or if some other part (e.g. DDS) is slow
- 10:25 → 10:30
  
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- 10:30 → 10:35
  ITS Tracking 5m
  
  Speaker: Matteo Concas (CERN)
  ITS GPU tracking
  
  General priorities:
  
  Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
  
  Optimizations via intelligent scheduling and multi-streaming can happen right after.
  
  Kernel-level optimisations to be investigated.
  
  Move remaining track-finding tricky steps on GPU
  
  ProcessNeighbours kernel has been ported and validated.
  
  Now fixing the logic that concatenates its usage multiple times as it is still faulty.
  
  TODO:
  
  Reproducer for HIP bug on multi-threaded track fitting: no progress yet.
  
  Fix possible execution issues and known discrepancies when using gpu-reco-workflow : no progress yet; will start after the tracklet finding is ported.
  
  DCAFitterGPU
  
  Deterministic approach via using SMatrixGPU on the host, under particular configuration: no progress yet.
- 10:35 → 10:45
  
  TPC Track Model Decoding on GPU 10m
  
  Speaker: Gabriele Cimador (Universita e INFN Trieste (IT))
- 10:45 → 10:55
  Efficient Data Structures 10m
  
  Speaker: Dr Oliver Gregor Rietmann (CERN)
  Efficient Data Structures
  
  In order to onboard Jolly (collaborator from ROOT) on the code more easily, I did the following:
  
  Added unit tests
  
  Added diagnostic functions (e.g. to count number of constructor calls, copies, moves, ...)
  
  Set up easy benchmarks (still ongoing).

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY

Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226

PDP-SRC issues

ITS GPU tracking

DCAFitterGPU

Efficient Data Structures