Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-09-27T11:00:00+02:00
End: 2023-09-27T12:20:00+02:00
Location: No location set

Wednesday 27 Sept 2023, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  Discussion 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  High priority framework topics:
  Problem at end of run with lots of error messages and breaks many calibration runs https://alice.its.cern.ch/jira/browse/O2-3315
  Incorrect "Dropping lifetime::timeframe" messages at EOR, to be investigated together with Giulio once the multi-threading is deployed.
  Fix START-STOP-START for good
  https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
  GPU Standalone multi-threading:
  Development basically finished, works in FST, but seen several problems online. Required also several changes to DPL, TPC code, and the GPU double-pipelined mode itself:
  Possibility to inject missing 0xDEADBEEF data at readout-proxy level, not on the fly, such that multiple processes can subscribe to raw data.
  Customize out-of-band channels with pipeline index, to support multiple GPUs.
  Customize TPC sector completion policy to enforce identical timeslice order of prepare device and processing device.
  Change double-pipeline such that output of previous TF is always finished before input for new TFs is created, and schedule them synchronizations better during TPC clusterization, to avoid hangs, and to guarantee full GPU utilization.
  Fixed issues since tests at P2:
  input proxy was missing headers when injecting data
  input proxy was injecting data at EoS
  fixed a race condition in DPL InitTask while waiting for FMQ region events
  TPC cluster / raw decoding QC was duplicated for the 2 threads, leading to problems with ROOT. Fixed by using the same TPC GPU QA instance for both pipeline threads, since they don't need it at the same time.
  Fixed but in TPC GPU QA dereferencing nullptr.
  Remaining issues:
  Run on staging crashes with TFs out of sync, apparently the completion policy is not working as it should.
  There could be more issues....
  Problem with QC topologies with expendable tasks- For items to do see: https://alice.its.cern.ch/jira/browse/QC-953 - Status?
  Problem in QC where we collect messages in memory while the run is stopped: https://alice.its.cern.ch/jira/browse/O2-3691
  Tests ok, will be deployed after HI and then we see.
  Switch 0xdeadbeef handling from on-the-fly creating dummy messages for optional messages, to injecting them at readout-proxy level.
  Will be steered by command line option, since should not be active for calib workflows.
  In any case the right thing to do. Less failure-prone, and will solve several issues, and particularly allow multiple processes to subscribe to raw data.
  Needed for standalone multi-threading.
  Eventually, will require all EPN and FLP workflows that process raw data to add a command line option to the readout proxy.
  In the meantime, we enable it only for EPNs, remove the optional flag only for the TPC GPU reco, and leave the rest as is. Should be backward compatible, and implies minimum changes during HI.
  Other framework tickets:
  TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
  Global calibration topics:
  TPC IDC workflow problem - no progress.
  TPC has issues with SAC workflow. Need to understand if this is the known long-standing DPL issue with "Dropping lifetime::timeframe" or something else.
  Even with latest changes, difficult to ensure guaranteed calibration finalization at end of global run (as discussed with Ruben yesterday).
  After discussion with Peter, Giulio this morning: We should push for 2 additional states in the state machine at the end of run between RUNNING and READY:
  DRAIN: For all but O2, the transision RUNNING --> FINALIZE is identical to what we do in STOP: RUNNING --> READY at the moment. I.e. no more data will come in then.
  O2 could finalize the current TF processing with some time out, where it stops processing incoming data, and at EndOfStream trigger the calibration postprocessing.
  FINALIZE: No more data is guaranteed to come in, but the calibration could still be running. So we leave FMQ channels open, and have a time out to finalize the calibration. If input proxies have not yet received the EndOfStream, they will inject it to trigger the final calibration.
  This would require changes in O2, DD, ECS, FMQ, but all changes except for in O2 should be trivial, since all other components would not do anything in these states.
  Started to draft a document, but want to double-check it will work out this way before making it public.
  Problem when calibration aggregators suddenly receive an endOfStream in the middle of the run and stop processing:
  Happens since ODC 0.78, which checks device states during running: If one device fails, it sends sigkill to all devices of the collection, and FairMQ takes the shortest way through the state machine to EXIT, which involves a STOP transition, which then sends the EoS. The DPL input proxy on the calib node should in principle check that it has received EoS from all nodes, but for some reason that is not working. Todo:
  Fix the input proxy to correctly count the number of EoS.
  Change the device behavior such that we do not send EoS in case of sigkill when running on FLP/EPN.
  CCDB:
  Performance issues accessing https://alice-ccdb.cern.ch by Matteo and me.
  Merged in alidist, PR to bump libwebsockets stuck by some MacOS / OpenSSL issues. Giulio is checking.
  JAlien-ROOT will switch to supporting libuv, so libwebsockets can be used by DPL and JAlien-ROOT with one libuv loop.
  This was reverted, since analysis reported a performance problem. Though I do not understand how the patch can cause a performance problem with the old libwebsockets, since there is a hard #ifdef LIBWEBSOCKETVERSION > ...
  Anyhow, they are checking the performance after the revert, and CCDB experts will check JAliEn-ROOT.
  Meanwhile, everyone who has the CCDB performance issues (the one we had, not the one in analysis), should locally bump to >= 0.7.3.
  Giulio's PR to bump libwebsockets 4.x, JAliEn-ROOT 0.7.4 and OpenSSL passes CI. To be tested by David if issues are solved. But should not be merged before analysis issues are understood.
  Problem with CCDB objects created at P2 not synched fast enough, so testing async reco directly did not work with CCDB failure. Costin did some steps to improve the syncing.
  Sync reconstruction:
  First Pb-Pb stable beam yesterday. From processing side ans stability of PDP (and also EPN nodes) very good. But several issues seen:
  EMCAL raw decoder crashing - EMCAL needs to check.
  ZDC QC too slow on FLPs, creating backpressure and dropping ZDC data. Fixed by downscaling QC ratio.
  ITS interlock triggered by cooling plant problem, required access.
  Sometimes fully empty ITS time frames, to be understoodd.
  Problem with MCH FLPs getting stuck - to be investigated.
  Problem with large background (even without collisions) making 2 ITS inner layer staves fully busy. For now excluded from readout. (1/12tf of acceptance in inner layer)
  Problem with TRD, LME disabling links. Basically all stacks but stack 2 (eta = 0) disabled. We don't have TRD data for larger z.
  CTF Size:
  ~325 CTFs per CTF file of 10 GB (10 GB = 10^10), so ~30 MB per CTF.
  I checked some TPC data manually, and I see 6-8 collisions per TF, which corresponds very well to the 2.3 kHz hadronic you quite (2.3 kHz hadronic / tf rate of 352 = 6.5). And assuming my visual inspection missed very peripheral collisions, hadronic rate should have been >= 2.3 kHz.
  That gives in average 4.7 MB per collision.
  Naively scaling to 50 kHz gives 235 GB/s, but is probably not realistic starting from such low IR. We should double check with higher IR.
  Async reconstruction
  Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes. No progress
  Checking 2 things: does the situation get better without GPU monitoring? --> Inconclusive
  We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
  Performance issue seen in async reco on MI100, need to investigate.
  EPN major topics:
  Fast movement of nodes between async / online without EPN expert intervention.
  2 goals I would like to set for the final solution:
  It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
  We must not lose which nodes are marked as bad while moving.
  Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
  Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
  Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
  Interface to communicate list of active EPNs to epn2eos monitoring: https://alice.its.cern.ch/jira/browse/EPN-381
  Calib nodes integrated by Federico, working
  DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
  Other EPN topics:
  Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
  Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
  After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
  Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
  Create user for epn2eos experts for debugging: https://alice.its.cern.ch/jira/browse/EPN-383
  EPNs sometimes get in a bad state, with CPU stuck, probably due to AMD driver. To be investigated and reported to AMD.
  TPC Raw decoding checks:
  Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.
  Full system test issues:
  Topology generation:
  Should test to deploy topology with DPL driver, to have the remote GUI available. Status?
  Software deployment at P2.
  Deployed several software updates at P2 with cherry-picks:
  Again several minor fixes.
  Also have a version with TPC multi-threading, but still not stable. Though multi-threading can be disabled and at least that seems stable in staging. But not yet used in production.
  Next updates to be included:
  Fix TPC Standalone multi-threading
  DPL fix for incorrect dropping lifetime::timeframe.
  QC / Monitoring / InfoLogger updates:
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  AliECS related topics:
  Extra env var field still not multi-line by default.
  GPU ROCm / compiler topics:
  Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
  Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
  Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
  New compiler regression in ROCm 5.6, need to create testcase and send to AMD.
  ROCm 5.7 released, didn't check yet. AMD MI50 will go end of maintenance in Q2 2024. Checking with AMD if the card will still be supported by future ROCm versions.
  TPC GPU Processing
  Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
  New TPC Cluster error parameterization merged..
  TPC Trigger extraction from raw data merged..
  Online runs at low IR / low energy observe weird number of clusters per track statistics.
  Problem was due to incorrect vdrift, though it is not clear why this breaks tracking so badly, being investigated.
  Ruben reported an issue with global track refit, which some times does not produce the TPC track fit results. To be investigated.
  ANS Encoding
  Tested new ANS coding in SYNTHETIC run at P2 with Pb-Pb MC. Actually slightly faster than before, and 14% better compression with on-the-fly dictionaries. (Though it should be noted that 5% better compression is already achieved in the compatibility mode, i.e. is from general improvements not dictionaries. And the dictionaries used for the reference were old and not optimal.) In any case, very good progress, and generally superior to old version.
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.
- 11:20 → 11:25
  
  TRD Tracking 5m
  
  Speaker: Ole Schmidt (CERN)
- 11:25 → 11:30
  
  TPC ML Clustering 5m
  
  Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))
- 11:30 → 11:35
  
  ITS Tracking 5m
  
  Speaker: Matteo Concas (CERN)