Alice Weekly Meeting: Software for Hardware Accelerators / PDP Run Coordination / Full System Test
-
-
11:00
→
11:20
PDP Run Coordination 20mSpeakers: David Rohr (CERN), Ole Schmidt (CERN)
Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
Reducing overhead from headers:
- Matthias is working on this: https://alice.its.cern.ch/jira/browse/O2-2395.
Issue with start / stop / start cycle.
- DPL processes die if ECS does a cycle start / stop / start. Currently this means a new partition must be created for every run: https://alice.its.cern.ch/jira/browse/O2-2559
- One fix in monitoring was not enough, Matthias will investigate once he is done with the headers.
Proper EOS handling in workflows (current plan):
- Summarized what we want to do in this JIRA: https://alice.its.cern.ch/jira/browse/O2-2715?filter=-2
- Giulio will work on it once the more important issues are solved.
Problem with EPN OS and GPUs:
- AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
- ROCm 4.5 fails to register memory when we switch to CentOS 8.5 (8.5 is the final CC8 version, since CC8 is no EOL. If we want to stick to CC8 on the EPNs for some time, perhaps it would make sense to install this latest version now).
- Inquired with AMD about future support of RHEL clones, waiting for reply.
New defaults for EPN:
- EPNs should not use dataflow defaults anymore, since it creates OpenSSL problems with alien access.
- We want to use -O3 / -march for the EPN builds, thus we have create `defaults-o2-epn.sh`, which is a clone of `defaults-o2.sh` but with different CFLAGS. To be tested (on my todo-list), then we switch the nightly EPN builds to these defaults.
Issues currently lacking manpower, waiting for a volunteer:
- For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
- Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
- Need someone (ideally from EPN) to check if the server crashes that were caused by the GPU monitoring are gone with ROCm 4.5.
- With ROCm 4.3, some GPUs don't have their clocks set to high, despite this is requested by the rocm-smi tool. Under investigation. We need to check if that is still the case with 4.5.
InfoLogger messages:
- Still need to switch default threshold from warning to important.
- All detector errors that do not need action by the operations should be demoted from error to alarm.
- Important warnings that we want to see in the InfoLogger should be promoted to alarm.
- Severity limit for infologger will be raised to state - no more warnings shown.
Time frame throttling:
- Initial version of time frame throttling merged (https://alice.its.cern.ch/jira/browse/O2-2589)
- Not working together with QC (thus unusable for sync reco).
- Throttling mechanism already extracted to dedicated class, to be used in different places.
- In order to use it for other cases but the raw-proxy, the ad-hoc setup of the out-of-band FMQ channel must be replaced by a generic out-of-band channel DPL feature.
Speeding up startup of workflow on the EPN:
- Now registering only the SHM segments that are actually used on the GPUs: https://alice.its.cern.ch/jira/browse/O2-2735
- Reduces GPU memory registration time to 70% of what it was before.
-
SHM management tool.
-
Work in progress.
-
-
Parallel registration on multiple GPUs.
With latest improvements, GPU registration needs ~5 seconds per GPU. Currently it is serialized, i.e. 40 seconds for 8 GPUs, which is way to much for the start of run. With ROCm 4.5 it should in principle be possible to register on multiple GPUs in parallel. When we tried this last time it was crashing. Need to retry now... -
Initial registration on the GPU:
With ROCm 4.5 the large slowdown for the initial registration to the first GPU is gone. This means that the GPU registration must not be done in the SHM management tool any more. We still need the management tool for the host allocation and mlocking (since that takes also too long). -
New feature: we must not create the DPL workflow JSON for each individual process (this means we start O(1000) processes each start of run just to create the JSONs. Discussing with the DDS team how to create it once and cache it: https://alice.its.cern.ch/jira/browse/R3C-696
-
Region callbacks for unmanaged region not executed in parallel, but arrives at one process after another: https://alice.its.cern.ch/jira/browse/O2-2737
EPN Scheduler
- Switched from htcondor to Slurm (better documentation and GPU support)
- Slurm set up, with 3 dev nodes (old HLT nodes), currently 19 EPNs in async partition, and currently 3 EPNs in calib partition (not clear yet if slurm will be used for the calibration eventually, but from the slurm side everything is ready)
- Presentation with some information on Slurm attached.
- Will present this to the detectors at the next weekly meeting, and then cancel the free EPN access.
Important framework features to follow up:
- Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
- Bug: ROOT writer storing TFs to out of sync entries in the ROOT file if DPL pipelines are used.
- Fix via completion policy in https://github.com/AliceO2Group/AliceO2/pull/7536, but needs additional framework support.
- Suppress default options when generating DDS command lines: https://alice.its.cern.ch/jira/browse/O2-2736
- Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
- Backpressure reporting when there is a only 1 input channel: no progress.
- (Not for this week) multi-threaded pipeline: no progress.
- (Not for this week) Problem with forwarding of multi-spec output: mid-entropy encoder receives the TF twice: no progress.
- Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
AMD / GPU stability issues in FST:
- ROCm 4.5 released. Currently being tested, some issues are fixed, but for now we cannot use it:
- Random crashes reproducing cosmics data: not yet clear.
- Internal compiler error with LOG(...) macros still exists, preparing minimal reproducer. Lubos might take a look.
- Memory registration error with ROCm 4.5 fixed. Can be used in production now.
- Open issues not yet tested with 4.5:
- 2 types of server crashes with kernel crash log hinting to amdgpu kernel module. Rare, not easily reproeducible. Not clear if the 2 are the same problem. AMD confirmed that at least 1 case is a problem they also see in their lab.
- Server crash with "Kernel NULLPTR dereferencing caused by GPU monitoring via rocm-smi.
GPU Performance issues in FST
- One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
- From the memory benchmarks we did, it seems we will not be able to fully recover the 10% performance we loose when going beyond 16GB, at least there is no easy recipe.
Minor open points:
- https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
- https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
- https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
- https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
- https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
- https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
- DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
- Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Detector status:
- EMCAL errors (no crash, just messages, EMCAL is working on it).
-
11:20
→
11:40
Software for Hardware Accelerators 20mSpeakers: David Rohr (CERN), Ole Schmidt (CERN)
Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
Report on GPU memory micro benchmark progress
- Matteo has assembled all results in a presentation (attached here for reference), which we have also shared with AMD.
- Follow up step is to repeat / improve the parameter range scan.
ITS GPU Tracking and Vertexing:
- Matteo will continue work on the tracking after the memory micro benchmarks are implemented.
TRD Tracking
- Working on strict matching mode (filter out ambiguous matches to obtain very clean track sample)
- David and Ole need to sit together to commission the TPC-TRD tracking on GPUs (as soon as possible...)
ANS Encoding
- Michael is implementing the proposed improvements in C++. Takes a bit longer than expected due to some refactoring that is needed.
-
11:00
→
11:20