Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC - MINUTES ONLY
→
Europe/Zurich
-
-
10:00
→
10:20
Discussion 20mSpeakers: David Rohr (CERN), Giulio Eulisse (CERN)
-
10:20
→
10:25
Following up JIRA tickets 5mSpeaker: Ernst Hellbar (CERN)
Low-priority framework issues https://its.cern.ch/jira/browse/O2-5226
- Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
- Merged workflow fails if outputs defined after being used as input
- needs to be implemented by Giulio
- Cannot override options for individual processors in a workflow
- requires development by Giulio first
- Problem with 2 devices of the same name
- Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
- Run getting stuck when too many TFs are in flight.
- Do not use string comparisons to derrive processor type, since DeviceSpec.name is user-defined.
- Support in DPL GUI to send individual START and STOP commands.
- Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
- Implement a proper solution to detect wheter a device is firstInChain
- Deploy topology with DPL driver
PDP-SRC issues
- Check if we can remove dependencies on
/home/epn/odc/files
in DPL workflows to remove the dependency on the NFS- reading / writing already disabled
- remaining checks for file existence?
- check after Pb-Pb by removing files and find remaining dependencies
logWatcher.sh
andlogFetcher
scripts modified by EPN to remove dependencies onepnlog
user- node access privileges fully determined by e-groups
- new
log_access
role to allow access inlogWatcher
mode to retrieve log files, e.g. for on-call shifters - to be validated on STG
- waiting for EPN for further feedback and modifications of the test setup
- computing estimate for 2024 Pb-Pb
- originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
- 11 % difference
- estimate from 2023 Pb-Pb replay data with 2024 software
- average hadronic interaction rate of Pb-Pb replay timeframes with pile-up correction for ZDC rate
- formula: IR_had = -ln(1 - rate_ZDC / (11245*nbc) ) * 11245 * nbc * 7.67 / 214.5
- 2023, 544490, nbc=1088, rate_ZNC=1166153.4 Hz: IR_had = 43822.164 Hz
- 2024, 560161, nbc=1032, rate_ZNC=1278045.2 Hz: IR_had = 48417.767 Hz
- 10.5 % difference in IR from 2023 relpay to 2024 replay
- 7 % to 47 kHz assumed for the 2023 replay data (?) when estimating the required resources
-
- could at least explain part of the difference between the estimated and observed margins
-
- originally assumed 305 EPNs suffucient, but needed 340 EPNs (5 % margin) in the end
- environment creation
- https://its.cern.ch/jira/browse/O2-5629
- cached topologies
- in practice, only works when selecting only one detector or when defining the
Detector list (Global)
specifically in the EPN ECS panel- when using
default
, the list of detectors is taken from default variables in ECS- not yet clear where this is set, it obviously depends on the selected detectors
- the order of detectors is always different, even for identical environments, therefore, the topology hash is also different and the cached topologies are not used
- investigating together with ECS team
- fix in Controls and ECS to provide an alphabetically ordered detector list
- topology hashes are now identical for identical environments
- speed up (in STG) going from 22 s the first time running the topology generation scripts to 5 s for the second time using the cached topology
- when using
- in practice, only works when selecting only one detector or when defining the
- start-up time
- ~80 sec spent in state transitions from
IDLE
toREADY
- will profile state transitions with
export DPL_SIGNPOSTS=device
to determine if we wait for single slow tasks or if some other part (e.g. DDS) is slow - Summary of time spent in state transitions
- Text file with summary information:
/home/ehellbar/env_creation_profiling/profile_transitions_2rYE2tBcysz/2rYE2tBcysz/transition_times_sorted.txt
Starting FairMQ state machine
toIDLE
- total of 35 s
- devices are started one by one, so timestamps are increasing device by device
- time between the last start of a device and the first initialization of a device is 15 s
- so DDS spents 20 s to send the start-up command to all the tasks
IDLE
toINITIALIZED
- 25 s for GPU RTC, all tasks waiting for it to finish
- in the shadow of GPU RTC, qc tasks themselves take up to 15 s for the Init callback to initialize the CcdbApi
DEVICE READY
toREADY
- total of 10 s (number obtained from InfoLogger messages)
- 6 - 9 s for shm mapping in gpu-reconstruction
- Text file with summary information:
- DDS waiting for all tasks to complete a transition at following steps (excluding steps where effectively 0 time is spent)
- until
IDLE
- until
INITIALIZED
- until
READY
- until
- ~80 sec spent in state transitions from
-
10:25
→
10:30
TPC ML Clustering 5mSpeaker: Christian Sonnabend (CERN, Heidelberg University (DE))
Items on agenda:
- Started writing the thesis: https://www.overleaf.com/read/hcvqgpxnjqnz#2cd750
- Simulations for different IRs done, always 50 Events. Using this to evaluate NN performance for different IRs:
- PbPb: 10, 30, 50 kHz
- pp: 100, 500, 1000, 1500 kHz
- Current developments:
- dE/dx tuning:
- Obtained workflow by Jens. Calibration object being produced and loaded, but no effect on tpcptracks.root observed yet. Investigating.
- Lambda / K0S reconstruction efficiency:
- Obtained workflow from Ruben
- Asked Sandro if it is possible to inject K0S and Lambda into FST simulation (basically add the digits on-top == merge the two simulations): Technically possible, only needs minor development on simulations framework
- dE/dx tuning:
Study of NN input size:
- Used currently exisiting training data to extract different "input-grids" (row x pad x time): (1x5x5), (1x7x7), (3x7x7), (5x7x7) and current reference case (7x7x7)
- Used 7x7x7 for classification and only compared effect for changing input to regression network
- Observations:
- Smooth behaviour for 2D cases (i.e. 1x5x5 and 1x7x7) at sector boundaries, as expected
- No reliable momentum vector estimate for 2D, as expected (since 3D charge information is needed)
- 3x7x7 vs 7x7x7: Phi estimate (= width of Phi distribution) worsens by around 10% across pT, Theta estimate by around 20%. But both distributions well centered across pT for both cases.
- qTot and dE/dx performance is basically equally as good for 2D as for 3D input. Sigma estimation actually slightly improves for 2D case
- 1x5x5 vs 1x7x7 (and also 3D): Width of CoG-time estimate at inner radii improves with larger pad-time window (left 1x5x5, right 5x7x7). CoG-pad estimate stays the very similar.
-
- Tracking efficiency at very low pT improves for 3D input, but is overall almost identical between both 2D and 3D case
Next steps:
- Small things: Update macro with axis label plotting and add comparison macro for all plots (to plot TGraphs of two QA outputs into one plot)
- Giulio: Get the alidist recipe working fully
- Repeat studies on all interaction rates
- Above mentioned: dE/dx calibration + Lambda-K0S analysis
-
10:30
→
10:35
ITS Tracking 5mSpeaker: Matteo Concas (CERN)
ITS GPU tracking
- General priorities:
- Focusing on porting all of what is possible on the device, extending the state of the art, and minimising computing on the host.
- Optimizations via intelligent scheduling and multi-streaming can happen right after.
- Kernel-level optimisations to be investigated.
- Move remaining track-finding tricky steps on GPU
- ProcessNeighbours kernel has been ported and validated.
- Now fixing the logic that concatenates its usage multiple times as it is still faulty. -> still WIP
- TODO:
- Reproducer for HIP bug on multi-threaded track fitting: no progress yet.
- Fix possible execution issues and known discrepancies when using
gpu-reco-workflow
: no progress yet; will start after the tracklet finding is ported.
DCAFitterGPU
- Deterministic approach via using
SMatrixGPU
on the host, under particular configuration: no progress yet.
- General priorities:
-
10:35
→
10:45
TPC Track Model Decoding on GPU 10mSpeaker: Gabriele Cimador (Universita e INFN Trieste (IT))
-
10:45
→
10:55
Efficient Data Structures 10mSpeaker: Dr Oliver Gregor Rietmann (CERN)
-
10:00
→
10:20