Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-02-16T11:00:00+01:00
End: 2022-02-16T12:20:00+01:00
Location: No location set

Wednesday 16 Feb 2022, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  PDP System Run Coordination 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  Reducing overhead from headers:
  
  New DD version merged, can now swith between old and new header format via env variable (working well).
  
  New format still crashes. Reported to Matthias, who has currently already 2 PRs open with fixes.
  
  Test yesterday night shows that these fixes cure the original crash I reported, but now it is crashing somewhere else. Needs further investigation, but good progress.
  
  Matthias can now also run CPU FST on EPN dev nodes and reproduce the issue.
  
  Issue with start / stop / start cycle.
  
  DPL processes die if ECS does a cycle start / stop / start. Currently this means a new partition must be created for every run: https://alice.its.cern.ch/jira/browse/O2-2559
  
  Something with stfsender on the FLPs was fixed and the ticket closed, but if I understand correctly, the problem that the proxy is not forwarding data after START->STOP->START persists?
  
  Proper EOS handling in workflows (current plan):
  
  Tested PR by Giulio, but still several issues:
  
  Not working for all devices, e.g. works for tpc tracking / entropy encoding, but not for readout-proxy. Not clear why.
  
  Timeout for stop transition works, but EoS does not stop the timeout.
  
  If an EoS is received to end the chain, it starts a timer to wait for another EoS.
  
  Giulio is already aware and working on fixes (some PRs open already)
  
  Giulio is improving the GUI to allow for better testing by sending state transitions manually. But I would consider this low priority, since I started to just hack the code and put in there manually what I want.
  
  GPU ROCm Issues to be followed up by EPN:
  
  EPN team will setup a test script for future validation of ROCm relases
  
  Check if ROCm 4.5 fixed the server crashes we had when enabling GPU monitoring
  
  Check if ROCm 4.5 fixes the issues that GPU clock speeds are sometimes set to low
  
  Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  
  Problem with EPN OS and GPUs:
  
  AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
  
  ROCm 4.5 fails to register memory when we switch to CentOS 8.5 (8.5 is the final CC8 version, since CC8 is no EOL. If we want to stick to CC8 on the EPNs for some time, perhaps it would make sense to install this latest version now).
  
  Inquired with AMD about future support of RHEL clones, waiting for reply.
  
  New ROCm version 5.0 relased. According to the release notes it has full CC 8.5 support now.
  
  I have asked AMD whether that also fixes the issue I have reported, but now answer yet.
  
  Asked EPN folks to install 2 test nodes, one with CC8.4 and one with CC8.5, so we can validate it.
  
  New defaults for EPN:
  
  Deployed and running on the EPNs.
  
  Unfortunately, cannot run on old HLT nodes since no AVX 2. Deploying normal O2 build on dev nodes.
  
  Still waiting for EPN to adjust deployment script, so we can have different builds on prod and on dev nodes (currently must still be done manually on the dev nodes).
  
  Full system test issues:
  
  Sylvain had a look at the raw file data set for the full scale test, looks mostly good.
  
  Some minor issues fixed (detector name capitalization, FIT FLP numbers).
  
  Some detectors don't create output for all equipment IDs. Ruben is checking with them if these equipments are simply not used / not simulated.
  
  One problem with HMPID links.
  
  Already most issues fixed, provided a new dataset yesterday, and will create yet another one when all points are fixed.
  
  Sylvain is preparing a readout config, checking how to run the test with AliECS.
  
  Issues currently lacking manpower, waiting for a volunteer:
  
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
  
  Time frame throttling:
  
  Not working together with QC (thus unusable for sync reco).
  
  Throttling mechanism already extracted to dedicated class, to be used in different places.
  
  In order to use it for other cases but the raw-proxy, the ad-hoc setup of the out-of-band FMQ channel must be replaced by a generic out-of-band channel DPL feature.
  
  Speeding up startup of workflow on the EPN:
  
  All done from the PDP side, startup time in standalone measurement: 18 seconds.
  
  Next steps:
  
  Get R3C-646 and R3C-696 closed.
  
  Deploy the tool on the EPN.
  
  Add automatic cleanup in the EPN state machine.
  
  Wait for new DDS version to support DPL config integrated in DDS XML topology file.
  
  Remeasure in a real partition controlled by AliECS, compare the time, and see where we have delays and loose time since things don't run in parallel.
  
  New feature: we must not create the DPL workflow JSON for each individual process (this means we start O(1000) processes each start of run just to create the JSONs. Anar is implementing the required config file feature in DDS: https://github.com/FairRootGroup/DDS/issues/406
  
  EPN Scheduler
  
  Next step: Add VObox for GRID
  
  Found one more issue: slurm interactive sessions blocks full node (instead of only requested number of CPU cores). Need to check slurm setup and improve that, since we have only 3 dev nodes.
  
  Important framework features to follow up:
  
  Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
  
  Bug: ROOT writer storing TFs to out of sync entries in the ROOT file if DPL pipelines are used.
  
  Fix via completion policy in https://github.com/AliceO2Group/AliceO2/pull/7536, but needs additional framework support.
  
  Wildcard matching with exceptions for input specs
  
  Suppress default options when generating DDS command lines: https://alice.its.cern.ch/jira/browse/O2-2736 - will drop this, not needed once we cache the DPL JSON topology.
  
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  
  Backpressure reporting when there is a only 1 input channel: no progress.
  
  (Not for this week) multi-threaded pipeline: no progress.
  
  (Not for this week) Problem with forwarding of multi-spec output: mid-entropy encoder receives the TF twice: no progress.
  
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  
  Output proxy spinning at 100%: no progress.
  
  EPN O2 Software deployment:
  
  This week deployed 2 O2 versions in parallel on Monday for milestone week: 1 for the MW, the other for DataDistribution tests (with new DD version) for incomplete TF building.
  
  New deployment scheme worked very well, with software selection in AliECS GUI. Will simplify the deployment in the future.
  
  Another update on Tuesday with a PR needed for ITS calibration tests.
  
  AMD / GPU stability issues in FST:
  
  New compiler bug in AMD hipcc encountered, recent changes in our code (unrelated to TPC) together with high optimization level (-O3) and function calls enabled (recommended by AMD for our code) leads to miscompilation of TPC looper following code.
  
  EPN has provided a test node with a reproducer, granted access to AMD today.
  
  In the meantime, as workaround the optimization level is reduced to -O2, at moderate performance penalty.
  
  ROCm 4.5 released. Result of tests of open issues:
  
  Random server crashes with error reported in amdgpu kernel module in dmesg: Not fully fixed, had at least one crash now, but with different dmesg message compared to before. Not clear if same issue or something different (or a hardware error, happened only once, but have only one test node so far).
  
  Random crash with noisy pp data: Disappeared with ROCm 4.5. Cannot reproduce it anymore. Was never understood. Hopefully it was a bug in previous ROCm that was fixed by chance in 4.5. Closed now.
  
  Random crash processing Pb-Pb data: still there, but happens similarly as in ROCm 4.3, thus no regression in 4.5. Need to debug further what exactly happens and then report to AMD.
  
  Error with memory registration: fixed.
  
  GPU Performance issues in FST
  
  One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
  
  Minor open points:
  
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  
  Workflow repository:
  
  Last reported bug in AliECS core for PDP expert panel fixed. Another GUI update addressing some coments from run coordination, with some simplifications, and finally with default O2DPG version taken from consul merged. Will be deployed at P2 next Monday.
  
  Restarting oncalls:
  
  Since MW1 mid of Januarry, daily meetings have restarted, and we give support at working hours.
  
  Full oncalls will restart on 21st of February - we should get the booking started.
  
  RC requests from us regular O2 software updates on the EPNs on Monday in parallel to the FLPSuite update. For now Ole offered to take care of it (some technical issues make it difficult to make this an oncall duty already now, should become one later).
  
  Oncall documentation updated.
  
  Doodle open for another training session next week.
  
  Detector status:
  
  EMCAL errors (no crash, just messages, EMCAL is working on it).
- 11:20 → 11:40
  Software for Hardware Accelerators 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  GPU Event display:
  
  Still working on Vulkan backend, no update this week.
  
  ITS GPU Tracking and Vertexing:
  
  Matteo will now continue with ITS GPU tracking.
  
  TPC GPU Tracking
  
  Implemented feature to update the calibration objects on the fly in a run, i.e. for the next time frame. Is then automatically updated on the GPU as well. To be tested this week with 2 different SCD transfomration maps.
  
  TRD Tracking
  
  TRD GPU tracking fixed, now fully working both with GPUTPCGM and with o2::Propagator track models (problem was different normalization of GPU fast polynomial b field).
  
  TPC dEdx correction
  
  Performance tested on MI50 GPUs, only minor performance impact, merged in O2.