Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2022-02-02T11:00:00+01:00
End: 2022-02-02T12:20:00+01:00
Location: No location set

Wednesday 2 Feb 2022, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  PDP Run Coordination 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  Reducing overhead from headers:
  
  Matthias is working on this: https://alice.its.cern.ch/jira/browse/O2-2395.
  
  Regression fixed, PR merged.
  
  Gvozden opened PR to integrate this in DD today. Will run FST for it later today?
  
  Issue with start / stop / start cycle.
  
  DPL processes die if ECS does a cycle start / stop / start. Currently this means a new partition must be created for every run: https://alice.its.cern.ch/jira/browse/O2-2559
  
  Had joined Debug Session with Giulio Matthias Adam David Federico.
  
  Issue reproduced and fixed, but there is another issue that after the second start the readout-proxy doesn't receive data, i.e. 2 runs in same partition still not working.
  
  Giulio has added restart functionality to the DebugGUI, so this can be reproduced and investigated locally.
  
  Proper EOS handling in workflows (current plan):
  
  Summarized what we want to do in this JIRA: https://alice.its.cern.ch/jira/browse/O2-2715?filter=-2
  
  Giulio will work on it once the more important issues are solved.
  
  GPU ROCm Issues to be followed up by EPN:
  
  EPN team will setup a test script for future validation of ROCm relases
  
  Check if ROCm 4.5 fixed the server crashes we had when enabling GPU monitoring
  
  Check if ROCm 4.5 fixes the issues that GPU clock speeds are sometimes set to low
  
  Create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  
  Problem with EPN OS and GPUs:
  
  AMD GPUs currently not working on CS8, investigating, for the moment the EPN must stay at CC8.
  
  ROCm 4.5 fails to register memory when we switch to CentOS 8.5 (8.5 is the final CC8 version, since CC8 is no EOL. If we want to stick to CC8 on the EPNs for some time, perhaps it would make sense to install this latest version now).
  
  Inquired with AMD about future support of RHEL clones, waiting for reply.
  
  New defaults for EPN:
  
  Tonight had the first build with epn defaults for the EPNs. Failed in O2Physics, so no RPMs avaialble that we can test. However, it passed building O2, i.e. the concerns we had about AVX code breaking the compile time execution of some O2 binaries on the build nodes might not be fatal, but need more statistics to see.
  
  Full system test issues:
  
  Merged this morning the last PR by MFT to write raw data files in correct way. Will update the JIRA ticket today and create a full FST raw file data set for Sylvain to test.
  
  Issues currently lacking manpower, waiting for a volunteer:
  
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
  
  InfoLogger messages:
  
  Finalized cleanup campaign.
  
  Went through all errors / warnings during pilot beam, and adjusted severities, such that detector issues show as alarm.
  
  Minimum severity raised to "important", i.e. "warnings" are no longer shown in InfoLogger, only in local log files on the nodes.
  
  Collection of important alarm/error messages will be added to shifter/oncall instruction by Ole, together with some hint what is the background / whom to call.
  
  Time frame throttling:
  
  Not working together with QC (thus unusable for sync reco).
  
  Throttling mechanism already extracted to dedicated class, to be used in different places.
  
  In order to use it for other cases but the raw-proxy, the ad-hoc setup of the out-of-band FMQ channel must be replaced by a generic out-of-band channel DPL feature.
  
  Speeding up startup of workflow on the EPN:
  
  SHM Management tool fully intergrated by Alexey, DD support added by Gvozden, NUMA Awareness added by David.
  
  Conducted PDP processing workflow startup time standalone measurements: 18 seconds for full Pb-Pb reconstruction workflow in 8 GPU setup on an EPN node.
  
  Reconstruction workflow, i.e. contains everything except QC and calibration (though adding QC and calibration should increase the time only marginally if at all).
  
  Tested with dev branches of O2 and DD with some custom hacks, i.e. currently not reproducible online (particularly, manually made the features requested in R3C-646 and R3C-696 work).
  
  One problem in DPL (cannot parse empty string arguments) currently prevents merging the changes in O2. https://alice.its.cern.ch/jira/browse/O2-2757
  
  Measurement assumed the same workflow was run before at least once, so that the JSONs / XMLs are cached.
  
  This includes the whole startup through all states, i.e. from processes not running --> INITIALIZED --> READY --> RUNNING (i.e. both the Configure and the Start in AliECS).
  
  Measured on a single node, but all EPNs start in parallel.
  
  From the PDP side, this is more or less as good as it gets. Perhaps we can cut 2 or 3 seconds more, but would expect significant improvements.
  
  Sorted out some additional issues / hangs with usage of SHM Manager tool. Should not support also the case of crashed processes.
  
  Tested yesterday in several iterations running many 8 GPU FSTs in the same segment and killing them with -9. Seems stable, but obviously that does not cover all cases.
  
  AliECS GUI changes to configure the segment ID for the workflow in case the tool is used merged, should be available with next FLPSuite next Monday. (Disabled by default, but can be enabled in the expert panel, and then we can test it in a real partition with the tool (from the PDP side)).
  
  Next steps:
  
  Get R3C-646 and R3C-696 closed.
  
  Deploy the tool on the EPN.
  
  Remeasure in a real partition controlled by AliECS, compare the time, and see where we have delays and loose time since things don't run in parallel.
  
  New feature: we must not create the DPL workflow JSON for each individual process (this means we start O(1000) processes each start of run just to create the JSONs. Discussing with the DDS team here: https://alice.its.cern.ch/jira/browse/R3C-696. Anar is implementing a config file feature in DDS: https://github.com/FairRootGroup/DDS/issues/406
  
  EPN Scheduler
  
  EPN access no restricted to only the login node, all further access via slurm (except for some selected people, who need e.g. to attach a debugger to sync processing).
  
  Set up a group to administrate who can submit to the prod partition (currently Matteo for GPU tests).
  
  Fixed issue with GPU access permissions (thx to Matteo for reporting).
  
  Next step: Add VObox for GRID
  
  Important framework features to follow up:
  
  Multi-threaded GPU pipeline in DPL: https://alice.its.cern.ch/jira/browse/O2-1967 : Giulio will follow this up, but currently some other things still have higher priority.
  
  Bug: ROOT writer storing TFs to out of sync entries in the ROOT file if DPL pipelines are used.
  
  Fix via completion policy in https://github.com/AliceO2Group/AliceO2/pull/7536, but needs additional framework support.
  
  Suppress default options when generating DDS command lines: https://alice.its.cern.ch/jira/browse/O2-2736 - will drop this, not needed once we cache the DPL JSON topology.
  
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  
  Backpressure reporting when there is a only 1 input channel: no progress.
  
  (Not for this week) multi-threaded pipeline: no progress.
  
  (Not for this week) Problem with forwarding of multi-spec output: mid-entropy encoder receives the TF twice: no progress.
  
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  
  Output proxy spinning at 100%: no progress.
  
  AMD / GPU stability issues in FST:
  
  ROCm 4.5 released. Result of tests of open issues:
  
  Random server crashes with error reported in amdgpu kernel module in dmesg: Not fully fixed, had at least one crash now, but with different dmesg message compared to before. Not clear if same issue or something different (or a hardware error, happened only once, but have only one test node so far).
  
  Random crash with noisy pp data: Disappeared with ROCm 4.5. Cannot reproduce it anymore. Was never understood. Hopefully it was a bug in previous ROCm that was fixed by chance in 4.5. Closed now.
  
  Random crash processing Pb-Pb data: still there, but happens similarly as in ROCm 4.3, thus no regression in 4.5. Need to debug further what exactly happens and then report to AMD.
  
  Error with memory registration: fixed.
  
  GPU Performance issues in FST
  
  One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
  
  Minor open points:
  
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  
  Workflow repository:
  
  Next FLPSuite update will provide more configuration options for PDP expert panel:
  
  O2DPG (workflow), O2PDPSuite (software), QC JSON (qc config) versions can be set individually. By default, will be the string `default`, in which case default versions are fetched from consul, that can be configured by RC.
  
  With this, we can deploy new O2 versions on the EPNs in parallel, test them, and upgrade / downgrade can be handled by RC by just changing the setting in consul.
  
  QC JSONs are still not versioned in consul, i.e. whenever we get a new version from AliECS, we fetch the JSONs, and store them in the workflow cache using the version provided by AliECS. This means, when detectors update the JSON files in consul, they need to bump this version manually for now (until FLP implements the versioning in consul).
  
  Currently, JSONs are still not fetched from consul, due to inconsistent naming scheme used, which is currently cleaned up by Barth.
  
  Supports setting shmid for shm management tool (see above).
  
  Can set fraction of raw data to be stored, will be forwarded to DD (not yet implemented in DD).
  
  Some cleanup.
  
  Detector status:
  
  EMCAL errors (no crash, just messages, EMCAL is working on it).
- 11:20 → 11:40
  Software for Hardware Accelerators 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  GPU Event display:
  
  Merged feature to support different backends (started a toy project over christmas to support Vulkan, and hoping to add that soon)
  
  ITS GPU Tracking and Vertexing:
  
  Matteo will now continue with ITS GPU tracking.
  
  TRD Tracking
  
  Restored TRD GPU Tracking on Run 2 data (required some minor fixes after recent changes in TRD tracking).
  
  Made TRD GPU Tracking work in O2 for Run 3 data, so far only working properly with GPU Track Model (GPUTPCGMTrackParam).
  
  Runs through with O2 track model (TrackParCov), but attaches less tracklets. Ole is investigating.
  
  TPC dEdx correction
  
  New multi-variant class implemented with GPU support, Jens will test performance in AMD GPUs in EPN.