Color code: (news from this week: blue, news from last week: purple, no news: black)
Full system test in Milestone Week
- Created new CI/build container for SLC8 with GPU support. New RPMs should now have proper GPU support on the EPN. Firsts tests failed due to differences in settings of the Jenkins builders, still under test.
- Still failing to do a full system test that includes data distribution starting from readout.exe on the FLPs. Problem on the level of assemblig the timeframes from all inputs. Current issues:
- TPC FLPs are configured to mask feeid for LinkZS, must be disabled for final ZS in MC files.
- Seing missing/corrupt FIT data in test with all detectors, OK in test with fewer detectors, investigating.
- TODOs for correct startup of workflows (should work in theorey, but still to be tested):
- Use --shm-segment-id 2 for StfBuilder, such that it cannot pin the main segments to the wrong NUMA domain.
- Use --child-driver option to pin the 2 NUMA workflows to the right domain.
AMD / GPU stability issues:
- Random server reboots: Problem signature slightly different with new patch, server does not reboot automatically but gets stuck, requiring manual reboot. Frequency of problem not Gaussian, can appear very soon, but sometimes no issue for > 1 day.
- Random application crash ~8 hours MTBF: Cannot reproduce anymore, possibly fixed in new ROCm. Closing for now.
- Still observing infrequent crashes reconstructing the cosmic data with GPU, still checking where it comes from.
- Reported new issue that compiler fails with internal error when optimization is disabled (-O0).
- FST startup failure during GPU memory registration: FST sometimes fails to start, works correctly if repeated. Happened also immediately after server reboot, which excludes that the server was in a weird state before.
GPU Performance issues
- One infrequent performance issue remains, single iterations on AMD GPUs can take significantly longer, have seen up to 26 seconds instead of 18 seconds. Under investigation, but no large global effect.
- Performance of 32 GB GPU with 128 orbit TF less than for the 70 orbit TF we tested in August. Results a bit fluctuating, but average is between 1600 and 1650 GPUs (compared to 1475 GPUs for 70 orbit TF). Matteo will implement low-level benchmarks to understand the memory behavior after he has finallized some important ITS work.
Status of blocking issues for performing FST:
- Workaround in place for the scheduling problem with 2 NUMA domains and DPL pipeline. Cannot be merged into git since it breaks other things. Waiting for improved workaround (this week) and proper fix (later).
Issues on EPN farm affecting FST:
- Network problem between containers --> connection aborts, failures to check out git, etc. Can be circumvent using ipoib instead of the ethernet connection.
Still missing features (of the list we discussed from April to August):
- Avoid reparsing of TPC input buffers in TPC reco workflow. Will be done by Matthias once he works again for O2.
- Usage of multi-threaded pipeline in DPL. Was still crashing with the latest version. Currently cannot test it, since the multi-threaded DPL branch is outdated and has merge conflicts: https://alice.its.cern.ch/jira/browse/O2-1967
Open minor DPL-related (or FairMQ) issues:
Issues with detectors:
- HMPID: possible bug in raw writer, segfaults, but only on EPNs. Under investigation.
- EMCAL errors (no crash, just messages, EMCAL is working on it).
Status of remaining detectors:
- TRD raw writer done, raw reader in PR status, reported some issues and waiting for a fix.
- HMPID: raw files created, bug in raw to digits decoder, cannot run CTF creation yet.
- FV0: raw files created, raw decoder still missing.
- MCH: Still problems in raw encoder, work in progress (https://alice.its.cern.ch/jira/browse/MRRTF-117)
Issues currently lacking manpower, waiting for a volunteer:
- Tool to "own" the SHM segment and keep it allocated and registered for the GPU. Tested that this works by running a dummy workflow in the same SHM segment in parallel. Need to implement a proper tool, add the GPU registering, add the FMQ feature to reset the segment without recreating it. Proposed to Matthias as a project. Remarks: Might be necessary to clear linux cache before allocation. What do we do with DD-owned unmanaged SHM region?
- For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108