Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
→
Europe/Zurich
-
-
10:00 AM
→
10:20 AM
Discussion 20mSpeakers: David Rohr (CERN), Giulio Eulisse (CERN)
-
10:20 AM
→
10:25 AM
Following up JIRA tickets 5mSpeaker: Ernst Hellbar (CERN)
Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226
- Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
- Merged workflow fails if outputs defined after being used as input
- needs to be implemented by Giulio
- Cannot override options for individual processors in a workflow
- requires development by Giulio first
- Problem with 2 devices of the same name
- Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
- Run getting stuck when too many TFs are in flight.
- Do not use string comparisons to derrive processor type, since DeviceSpec.name is user-defined.
- Support in DPL GUI to send individual START and STOP commands.
- Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
- Implement a proper solution to detect wheter a device is firstInChain
- Deploy topology with DPL driver
PDP-SRC issues
- Check if we can remove dependencies on
/home/epn/odc/files
in DPL workflows to remove the dependency on the NFS- reading / writing already disabled
- remaining checks for file existence?
- check after Pb-Pb by removing files and find remaining dependencies
logWatcher.sh
andlogFetcher
scripts modified by EPN to remove dependencies onepnlog
user- node access privileges fully determined by e-groups
- new
log_access
role to allow access inlogWatcher
mode to retrieve log files, e.g. for on-call shifters - to be validated on STG
- waiting for EPN for further feedback and modifications of the test setup
- computing estimate for 2024 Pb-Pb
- originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
- 11 % difference
- estimate from 2023 Pb-Pb replay data with 2024 software
- average hadronic interaction rate of Pb-Pb replay timeframes with pile-up correction for ZDC rate
- formula: IR_had = -ln(1 - rate_ZDC / (11245*nbc) ) * 11245 * nbc * 7.67 / 214.5
- 2023, 544490, nbc=1088, rate_ZNC=1166153.4 Hz: IR_had = 43822.164 Hz
- 2024, 560161, nbc=1032, rate_ZNC=1278045.2 Hz: IR_had = 48417.767 Hz
- 10.5 % difference in IR from 2023 relpay to 2024 replay
- 7 % to 47 kHz assumed for the 2023 replay data (?) when estimating the required resources
-
- could at least explain part of the difference between the estimated and observed margins
-
- originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
- environment creation
- cached topologies
- in practice, only works when selecting only one detector or when defining the
Detector list (Global)
specifically in the EPN ECS panel- when using
default
, the list of detectors is taken from default variables in ECS- not yet clear where this is set, it obviously depends on the selected detectors
- the order of detectors is always different, even for identical environments, therefore, the topology hash is also different and the cached topologies are not used
- investigating together with ECS team
- when using
- in practice, only works when selecting only one detector or when defining the
- start-up time
- ~80 sec spent in state transitions from
IDLE
toREADY
- will profile state transitions with
export DPL_SIGNPOSTS=device
to determine if we wait for single slow tasks or if some other part (e.g. DDS) is slow
- ~80 sec spent in state transitions from
- cached topologies
-
10:25 AM
→
10:30 AM
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
-
10:30 AM
→
10:35 AM
ITS Tracking 5mSpeaker: Matteo Concas (CERN)
ITS GPU tracking
- General priorities:
- Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
- Optimizations via intelligent scheduling and multi-streaming can happen right after.
- Kernel-level optimisations to be investigated.
- Move remaining track-finding tricky steps on GPU
- ProcessNeighbours kernel has been ported and validated.
- Now fixing the logic that concatenates its usage multiple times as it is still faulty.
- TODO:
- Reproducer for HIP bug on multi-threaded track fitting: no progress yet.
- Fix possible execution issues and known discrepancies when using
gpu-reco-workflow
: no progress yet; will start after the tracklet finding is ported.
DCAFitterGPU
- Deterministic approach via using
SMatrixGPU
on the host, under particular configuration: no progress yet.
- General priorities:
-
10:35 AM
→
10:45 AM
TPC Track Model Decoding on GPU 10mSpeaker: Gabriele Cimador (Universita e INFN Trieste (IT))
-
10:45 AM
→
10:55 AM
Efficient Data Structures 10mSpeaker: Dr Oliver Gregor Rietmann (CERN)
-
10:00 AM
→
10:20 AM