Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC

Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-03-15T11:00:00+01:00
End: 2023-03-15T12:20:00+01:00
Location: No location set

Wednesday 15 Mar 2023, 11:00 → 12:20 Europe/Zurich

61230224927

David Rohr

Join via phone

- 11:00 → 11:20
  Discussion 20m
  
  Speakers: David Rohr (CERN), Ole Schmidt (CERN)
  Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)
  
  High priority framework topics:
  Problem at end of run with lots of error messages and breaks many calibration runs https://alice.its.cern.ch/jira/browse/O2-3315
  Understood the problem with the reproducer in the FST which Ole provided end of last year. Fully fixed by https://github.com/AliceO2Group/AliceO2/pull/10940.
  Unfortunately, online there are still errors. Tried in staging yesterday (e.g. partition 2e5XVwXYwvX), and there are still different types of errors.
  Since EndOfStream is inherently unreliable for final calib uploads, had a discussion yesterday how to do this in a more controlled way in the future.
  Created some documentation here: https://alice.its.cern.ch/jira/browse/O2-3628
  Giulio provides a new check whether STOP is requested: https://github.com/AliceO2Group/AliceO2/pull/10948
  Ruben to create example task to follow the new scheme, then all detectors should adapt.
  Fix START-STOP-START for good
  https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
  Checking the CALIB workflows at EoR, several issues were fixed along the line:
  Some states not reset, fixed by https://github.com/AliceO2Group/AliceO2/pull/10940
  Counters for oldestPossibleTimeslice and timesliceId of source are not in sync, and so far it worked more or less by chance. There are many counters, and to avoid resetting all of them in parallel, https://github.com/AliceO2Group/AliceO2/pull/10942 unifies them to one source for the timesliceId. Still under review since it has the chance to break something, but I think there is no alternative.
  Async workflow for 1NUMA domain with higher multiplicities gets stuck: https://alice.its.cern.ch/jira/browse/O2-3399
  Problem with async workflow getting stuck fully fixed.
  This problem fixed also the problem with the workflow sometimes being slow due to late arrival of the time frame throttling feedback.
  Problem with workflow sometimes slow due to "oscilations" in the processing understood:
  The workflow is fastest and in a stable state if the rate at all stages is identical, irrespective of the number of pipelines of a stage.
  Since we choose the pipelines slightly larger than needed, to account for variations in the rate, and for time frame fluctuations, more pipelines can act in one stage at a time than optimal, increasing the rate at that stage. This can lead to 2 problems:
  The CPU cores can be oversubscribed, making all processes slower, which slows down stages with single (or few) pipeline instances, which might be on the critical path at that point in time.
  Stages with many pipeline instances can empty their input queue too fast, e.g. if too many timeframes are published at once, and then they have nothing to process. While the other stages do not run enough processes to use all CPU cores.
  Not so easy to fix by adapting the pipeline sizes, most promising solution is to implement a heuristic to throttle the publishing to an average rate, instead of publishing bunches of many time frames irregularly, whenever the throttling feedback arrives from the end of the chain.
  Note also that this effect is most prominent at the beginning of async reco, and then it averages out. But this can take 2-3 hours.
  This problem does not appear in sync reco, since sync reco has a fixed input rate.
  Multi-threaded pipeline still not working in FST / sync processing, but only in standalone benchmark.
  Suppoort marking QC tasks as non-critical in DDS and O2Control topology export: https://alice.its.cern.ch/jira/browse/O2-3398
  Other framework tickets:
  After 64k vertex problem fixed, now problem in DebugGUI that it gets stuck at 100% CPU for large workflows https://alice.its.cern.ch/jira/browse/O2-3535
  Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
  Backpressure reporting when there is only 1 input channel: no progress.
  Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
  https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
  https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
  https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
  https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
  https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
  https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
  DPL Raw Sequencer segfaults when an HBF is missing. Fixed by changing way raw-reader sends the parts, but Matthias will add a check to prevent such failures in the future.
  Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
  Support in DPL GUI to send individual START and STOP commands.
  Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
  Global calibration topics:
  TPC calib problem: Need to repeat the test with latest fixed, but as we still have problems in all online runs, I think it will require more work..
  Async reconstruction
  Fixed the problems with running local apptainer installations, seems now fully working.
  Dirk has set up an isolated container network config with full routing (i.e. same network as we run now, but isolated). Will test this on one node as proof of concept that our setup is working and can be used to provide custom network configs to the GRID container.
  No clear decision how to proceed.
  GRID group would prefer a setup with only 1 public IP per node, current setup, that was decided in the first meeting 6 weeks ago, requires 1 public IP per container, i.e. 2 per nodes.
  That is sufficient as long as we run 1 container per NUMA domain, and we could also get the 700 IPs per NUMA domain.
  In case for some reason we might want to run more containers, this would become complicated since it would require more IPs.
  Natural solution to have one IP per node would be a full cluster container orchestration system like Kubernetes, but probably overkill for running async jobs on the EPN farm.
  We also need to investigate what output connectivity the job containers actually need, to make sure all routes are present.
  Had 2 CCDB problems, one with main CCDB server, other with EOS instance, that interrupted async reco. Both fixed.
  Async reco with 1NUMA domain setup:
  Processing fully fixed, and tested in few runs on the GRID (EPN site).
  Will switch to the new setup with the next production, since it requires a new O2 version with many changes --> difficult to cherry pick.
  Observed a memory problem in async reco, could be memory leaks in multiple places. Most prominently:
  dpl dummy sink memory increases linearly over time, looks pretty much like a leak.
  its tracking and gpu reco have jumps where memory consumption increases by O(1 GB), and doesn't go down again. Not clear how this happens. For GPU reco it should be impossible, since all buffers are static. Could also come from a library or GPU runtime.
  aod-producer memory is also increasing over time, but not really linearly. Should also be checked.
  At the moment, we can either run with reduced number of CTF files per job, or with less time frames in flight, to avoid that we run out of memory during the lifetime of the job.
  Next step is to fully tune the 1NUMA workflow with latest O2 (performances of processes have changed), deploy for production, and do thorough comparison of 1NUMA, 1 GPU, and CPU only with EPN nodes under full load.
  Problem with CCDB pulling in libUV, breaking ROOT since headers not available. Broke async reco with latest O2 tag. Fixed by Giulio: https://github.com/AliceO2Group/AliceO2/pull/10954
  EPN major topics:
  EPN OS discussion
  Still no reply from AMD / NVIDIA for Alma support.
  Jenkins build for ALMA Linux / ROCm 5.4.2 working now.
  EPN staging nodes moved to ALMA 8 / ROCm 5.4.2. Production nodes will be moved with software update on Monday.
  Also async reco software will need to be recompiled (i.e. recompile same tag with new builder).
  Fast movement of nodes between async / online without EPN expert intervention.
  Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
  Check total bytes written on the SSDs, to get an impression how much of the lifetime of the SSDs we have already used, switch to using a ramdisk if necessary: https://alice.its.cern.ch/jira/browse/EPN-198 https://alice.its.cern.ch/jira/browse/EPN-237
  Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244
  Need DDS/ODC feature to deploy different topologies on EPNs with MI50 and with MI100.
  Go to error state if a critical task (e.g. calib task) fails (taking nmin into accound). But currently we do not see failing calibration tasks at all, except for a message in infologger. ODC should go to error, and then ECS should stop the run automatically. Also when n < nmin.
  Other EPN topics:
  Deploy DDS with support for non-critical tasks (eg QC): https://alice.its.cern.ch/jira/browse/EPN-131
  Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
  Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
  After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
  Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
  If DD connection of a node fails, the node should be taken out and count against nmin, otherwise it can make the false impression that the processing on the other nodes is too slow.
  EPN farm upgrade:
  Full system test on EPN prototype with 8 MI100 GPUs crashing randomly (MTBF 5min to 2h)
  GPU performance / CPU load / memory usage / GPU temperatures ok.
  GPU problem being under investigation
  Changed build system so that we can have the same build for MI50 and MI100. Verified that same O2 build is stable on MI50 and crashes on MI100.
  Happens also if FST uses only a single GPU (I.e. does not come from the interplay of multiple GPUs).
  Crash does not appear running the standalone benchmark on the same dataset, which looks like a compilation problem: Either a compiler bug, or a bug in the code leading to memory corruption manifesting differently.
  Full system test issues:
  Full system test back to stable, had to revert one more MCH PR that broke it (for unclear reasons, Laurent is investigating).
  FST with pp data crashes on the EPNs in digitization phase due to alignment problem of DCS CCDB object. Could be a ROOT bug. Ruben is investigating.
  Topology generation:
  Added additional logging of topology generation command, provided by new ODC version deployed on Monday. Logging now more or less final in the sense that new features will only be added upon request.
  ECS command switched, to call topology generation from updateable RPMs directly. Removed the /home/epn/pdp folder on EPN shared home folder.
  Ole is investigating to use set -u or set -e to catch more errors, but has some drawbacks. Current plan is to use -u, but not -e. To be merged when Ole is back from vacation in 2 weeks.
  Switch to 32 orbit TF:
  MCH adapted their code to read TF length from CCDB.
  Only TOF is left, who use HBFUtils for TFLength, which is not initialized correctly online (can be done with extra env variable).
  Provided new 32 orbit SYNTHETIC datasets for P2, currently Pb-Pb only since pp simulation fails.
  QC / Monitoring / InfoLogger updates:
  TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.
  CCDB topics.
  Costin will follow up some proposed improvements to have a proper fix for the future. See https://alice.its.cern.ch/jira/browse/O2-3097?filter=-2
  https://github.com/AliceO2Group/AliceO2/pull/9992#event-7875018488
  AliECS related topics:
  Improve error message in AliECS GUI for EPN related failures. PDP error messages are sent via ODC in the Run reply, e.g. for topology generation failures, but ECS does not show them, but only shows generic "EPN Partition Initialize Failed" https://alice.its.cern.ch/jira/browse/OCTRL-734
  ODC / Topology generation error messages now shown in the ECS GUI. Though GUI is a bit ugly and text is a bit convoluted with other content. Vasco is aware and they will clean it up in the next releases.
  GPU ROCm / compiler topics:
  Found a regression in ROCm 5.4.3, crashing with both MI50 and MI100. Sticking to 5.4.2 for now. Need to create good reproducer and report to AMD.
  Clang 15 / Arrow 11 / Root 6.28 merged.
  Codechecker adapted to new clang
  OpenCL switched to 3.0 standard, and now using clang-internal SPIRV backend instead of LLVM IL to SPIRV converter.
  Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
  Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
  Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
  Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
  Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
  TPC GPU Processing
  Random GPU crashes under investigation.
  TPC Full distortion correction merged.
  Problem with tracking leaving physical volume of the TPC fixed.
  Fixed incorrect error assignment in tracking efficiency QA plots.
  TRD Tracking
  ITS GPU Tracking and Vertexing:
  - Time frame splitting on different streams now fully functional for vertexer: different memory size configurations produce same results (can run from 1GB).
  Performance change with allocated memory, tend to saturate at some point.
  - Removed synchronisation point we had (see attached figure) at the end of parallel processing (cpu threads on different streams) of one chunk per thread.
  Now the next chunk to be processed is scheduled inside the thread such that in the end N threads process 1/Nth of TF in total before synchronisation.
  Recovered tenth of ms on the total elapsed time.
  - Tracking adaptation to timeframe splitting resumed, expect first PR this week.
  
  TPC ML clustering
  First results with data generation and training on GSI cluster. First results not so good, loss function needs to be adapted and retried.
  ANS Encoding
  Michael aims to have the PR for intergration in O2 on March 27th.
  Issues currently lacking manpower, waiting for a volunteer:
  For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
  Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC