Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-11-29T11:00:00+01:00
End: 2023-11-29T12:20:00+01:00
Location: No location set

1

Discussion

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

High priority framework topics:

Problem with EndOfStream / Dropping Lifetime::Timeframe data
- 3rd debug feature in development, ETA end of next week.
Related: Should we change the consumeWhenAll completion policy, to wait for the oldestPossibleTimeframe update on all channels, to avoid that sporadic data can be missed.
- We can add a "consumeWhenAllTimeframe" policy with the old behavior is needed, but I think in most cases the change will not decrease the performance. It can only delay the processing if there are multiple sporadic inputs, while we wait for all TF inputs, and not all sporadic inputs arrive, which is a very special case.
- But it general, I think this is an important safety measure, and can avoid that we miss processing some sporadic data. Note that this will not trigger any error. And there are cases where the presense of other workflows changes the routes, and sporadic data for multiple inputs can arrive together or as 2 messages... So if we have such an error at some point, it would be very difficult to understand.
- Better to exclude it from the beginning.
Fix START-STOP-START for good
- All known issues on DPL side fixed, now other systems need to fix up their parts:
  - MCH / ITS / EMC crash in QC postprocessing, Barth / Piotr are following this up.
  - TPC Track QC is crashing, but unrelated - can always happen when the run is short. Robert is checking.
  - Readout sends wrong STFIDs
Problem with QC topologies with expendable tasks - Fixed in DPL, waiting for feedback.
New issue: sometimes CCDB populator produces backpressure, without processing data. Crashed several Pb-Pb runs yet: https://alice.its.cern.ch/jira/browse/O2-4244
- Disappeared after disabled CPV gain calib, that was very slow. However, this can only have hidden the problem. Apparently there is a race condition that can trigger a problem in the input handling, which makes the CCDB populator stuck. Since the run funciton of the CCDB populator is not called and it does not have a special completion policy, but simply consumeWhenAny, this is likely to be a generic problem.
- Cannot be debugged Pb-Pb right now, since it is mitigated. But must be understood afterwards.

Other framework tickets:

TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681
Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress: https://alice.its.cern.ch/jira/browse/O2-4237
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
DPL sending SHM metrics for all devices, not only input proxy: https://alice.its.cern.ch/jira/browse/O2-4234
Some improvements to ease debugging: https://alice.its.cern.ch/jira/browse/O2-4196 https://alice.its.cern.ch/jira/browse/O2-4195 https://alice.its.cern.ch/jira/browse/O2-4166
After Pb-Pb, we need to do a cleanup session and go through all these pending DPL tickets with a higher priority, and finally try to clean up the backlog.

Global calibration topics:

TPC IDC and SAC workflow issues to be reevaluated with new O2 at restart of data taking. Cannot reproduce the problems any more.

Async reconstruction

Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes.
- Checking 2 things: does the situation get better without GPU monitoring? --> Inconclusive
- We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
Chiara reported again lower performance on both MI50 and MI100 EPNs in async reco, needs to be investigated.
- Discussed with Chiara how I can reproduce it, but didn't check yet.
Async reco performance:
- Work in progress - already a significant speed up but not enough.
- TOF matching now supports multi-threading, which should remove it from the critical path of the latency.

EPN major topics:

Fast movement of nodes between async / online without EPN expert intervention.
- 2 goals I would like to set for the final solution:
  - It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
  - We must not lose which nodes are marked as bad while moving.
Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
- Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
TfBuilders should stop in ERROR when they lose connection.

Other EPN topics:

Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
Create user for epn2eos experts for debugging: https://alice.its.cern.ch/jira/browse/EPN-383
EPNs sometimes get in a bad state, with CPU stuck, probably due to AMD driver. To be investigated and reported to AMD.

Raw decoding checks:

Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.

Full system test issues:

Topology generation:

Should test to deploy topology with DPL driver, to have the remote GUI available. Status?

QC / Monitoring / InfoLogger updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.

AliECS related topics:

Extra env var field still not multi-line by default.

FairMQ issues:

Problems with FairMQ solved.
- EPNs and TPC need to find a good size of the refCount region for TPC laser runs (default setting is too small, but I also don't want to waste memory). We'll have a joint test session later today.
- There is a problem that the refCount segment of the SHM tool gets deleted and DD recreates its on file afterwards, which works, but only by chance. Should probably be followed up by EPN and FMQ experts.

High priority RC YETS issues:

All crashes of code under PDP responsibility fixed. Remaining crashes: 1 in QC, 2 in DataDistribution
Make Start / Stop / Start work: All known framework issues fixed. Remaining problems: 1 in Readout, 2 in QC
Fix dropping lifetime::timeframe for good: 2 debug features merged, and several issues in user code fixed, that were revealed by it. ETA for third feature: end of next week.
Fix problem with input-proxy not dropping data: fixed. For problem with network buffer limits in FMQ, we discussed possible solutions in https://alice.its.cern.ch/jira/browse/O2-4414. Unfortunately, with current FMQ / ZMQ, there is no good solution. QC could try to use REQ/REP instead of PUB/SUB. To be tested.
Expandable tasks in QC: PRs by Giulio to O2 and QC. Everything merged now? So we can test next week.
Stabilize calibration / fix EoS: We have a plan how to implement it. Will take some time, but hopefully before restart of data taking.
Fix problem with ccdb-populater: no idea yet, no ETA.
Added a new issue: if one EPN is slow during node allocation, that kills the whole run, even if nmin is fulfilled. Happened 3 times during the tests on Tuesday. Opened https://alice.its.cern.ch/jira/browse/EPN-432

GPU ROCm / compiler topics:

Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
New compiler regression in ROCm 5.6, need to create testcase and send to AMD.
ROCm 5.7.1 not working, waiting for AMD to reply.

TPC GPU Processing

Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
New problem with bogus values in TPC fast transformation map still pending. Sergey is investigating, but waiting for input from Alex.
TPC has provided the dead channel map as calib object. Next step now is to respect it during tracking, and do not abort tracking if no hits are found when the channels are dead.

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

2

TRD Tracking

Speaker: Ole Schmidt (CERN)

3

TPC ML Clustering

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

Ongoing activites

Looper tagging & pad-parallel tagging
Retrain NN's to exclude loopers
pT & η differential studies

1. Looper tagging

Both looper tagging and pad-parallel tagging work
Same algorithm, different "cuts" -> In process of optimization (fewest possible rejected volume with maximum possible fake reduction)
Total rejected TPC volume ~15-16%

Weakness for larger loopers -> needs fixing with applied cuts

2. p_T & η differential studies

Checked performance with and without looper tagging
Results make sense and show that the network get's more "confused" with loopers

3. Retraining NN's

Next step in the pipeline
Need to check that training data is read out correctly (some changes were necessary in the QA task)

4. Visualization & PyTorch

Found a nice GitHub repo with some latex code for plotting neural networks

Pytorch forums: Opened thread on Conv3D not utilizing tensor cores - https://discuss.pytorch.org/t/conv3d-tensor-core-utilisation/192389

Minutes from private discussion

1. Looper tagger

Only use pad-parallel tagging
Only mask regions around individual clusters, not full regions of investigation
Re-branding of looper tagger -> "Exclusion of identified regions from efficiency calculation"
Re-branding of pad-parallel tracks -> tracks with high inclination angle

2. Cluster writing

Create own tpc-native-clusters.root to run tracking and create tracking QA once running

3. Neural network

Try with only fully connected layers

4

ITS Tracking

Speaker: Matteo Concas (CERN)

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC