Name: Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC
Start: 2023-11-15T11:00:00+01:00
End: 2023-11-15T12:20:00+01:00
Location: No location set

11:00 → 11:20

Discussion 20m

Speakers: David Rohr (CERN), Ole Schmidt (CERN)

Color code: (critical, news during the meeting: green, news from this week: blue, news from last week: purple, no news: black)

High priority framework topics:

Problem with EndOfStream / Dropping Lifetime::Timeframe data
- Together with Giulio, we debugged and fixed plenty of bugs, and apparently these 2 major topics had partially the same root causes:
  - NewRun detection in input proxy was broken.
  - 0xCCDB message was send with incorrect runNumber / startTime.
  - DPL was sending 0xCCDB message also when there was no output.
  - With output detection added, devices that do not use DPL to send messages (input proxy, raw tf reader, raw file reader) were not triggering the 0xCCDB message, since DPL was not aware of the output. Changed them to notify DPL that there was an output.
  - EndOfStream counter in input proxy was only taking into account one channel.
  - Bogus messages with runNumber 0 caused input-proxy to send EoS immediately (which we received due to problem with runNumber in 0xCCDB message).
  - EoS counter in input proxy did not work correctly if there was no data received before the EoS.
- With all these fixes, the majority of the error flood in staging and in SYNTHETIC runs on production are gone, but we are still not done:
  - The EndOfStream mechanism should be fully working now, but by design it cannot handle the case where either an EPN fails, but the FMQ channel is not yet closed so the input proxy will wait for EoS indefinitely, or when the processing after EoS exceeds the 20s grace period of DPL.
    - Both these issues can only be fixed once we add an additional state to the state machine (draft still work in progress).
  - We still see some "Dropping lifetime::timeframe" messages in SYNTHETIC / COSMIC runs. I believe these are now real errors in user code, which misses to send output in some cases. Asked Giulio to add sanity checks in DPL for such cases, otherwise it is extremely difficult to identify the ROOT cause: https://alice.its.cern.ch/jira/browse/O2-4284 https://alice.its.cern.ch/jira/browse/O2-4283
  - In async reco tests on the EPNs, I have seen such messages as well, and this led to up to 2% of TFs not reaching the AOD writer: https://alice.its.cern.ch/jira/browse/O2-4291
    - Chiara checked that this apparently doesn't happen on the GRID, must be timing related.
    - Will add check for this to the async reco validation scripts.
    - Must be investigated further.
- Latest update on oldestPossibleTimeframe / dropping lifetime::timeframe:
  - Most errors online are gone, InfoLogger much cleaner than before.
  - This severely affects async reco on the EPNs, majority of jobs go to error.
    - Both the current async tag, and David's test on the EPN were done with O2 versions that have known problems, which are meanwhile fixed. Need to recompile with O2/dev on the EPNs and retry.
  - Still issues in online runs, asked Giulio to implement 3 checks (opened 3 JIRA tickets).
    - 1 improved error messages implemented and merged, 1 in PR, one still in development.
Fix START-STOP-START for good
- https://github.com/AliceO2Group/AliceO2/pull/9895 still not merged due to a conflict.
- Functionality to reset counters at STOP added.
- Unfortunately not fully working, and counters still out of sync after restart. Currently being debugged by Giulio and David.
Problem with QC topologies with expendable tasks- For items to do see: https://alice.its.cern.ch/jira/browse/QC-953 - Status?
Problem in QC where we collect messages in memory while the run is stopped: https://alice.its.cern.ch/jira/browse/O2-3691
- Tested with latest fixes, but was not working. New fixes merged in O2. Can test next week after the next software update.
Switch 0xdeadbeef handling from on-the-fly creating dummy messages for optional messages, to injecting them at readout-proxy level.
- Done
- Was needed more urgently for CTP QC, and RC said waiting for detectors to change their workflows might take long. So we changed all workflows centrally.
- Updated documentation, and asked detectors to double-check.
New issue: sometimes CCDB populator produces backpressure, without processing data. Crashed several Pb-Pb runs yet: https://alice.its.cern.ch/jira/browse/O2-4244
- Disappeared after disabled CPV gain calib, that was very slow. However, this can only have hidden the problem. Apparently there is a race condition that can trigger a problem in the input handling, which makes the CCDB populator stuck. Since the run funciton of the CCDB populator is not called and it does not have a special completion policy, but simply consumeWhenAny, this is likely to be a generic problem.
- Cannot be debugged Pb-Pb right now, since it is mitigated. But must be understood afterwards.

Other framework tickets:

TOF problem with receiving condition in tof-compressor: https://alice.its.cern.ch/jira/browse/O2-3681
Grafana metrics: Might want to introduce additional rate metrics that subtract the header overhead to have the pure payload: low priority.
Backpressure reporting when there is only 1 input channel: no progress: https://alice.its.cern.ch/jira/browse/O2-4237
Stop entire workflow if one process segfaults / exits unexpectedly. Tested again in January, still not working despite some fixes. https://alice.its.cern.ch/jira/browse/O2-2710
https://alice.its.cern.ch/jira/browse/O2-1900 : FIX in PR, but has side effects which must also be fixed.
https://alice.its.cern.ch/jira/browse/O2-2213 : Cannot override debug severity for tpc-tracker
https://alice.its.cern.ch/jira/browse/O2-2209 : Improve DebugGUI information
https://alice.its.cern.ch/jira/browse/O2-2140 : Better error message (or a message at all) when input missing
https://alice.its.cern.ch/jira/browse/O2-2361 : Problem with 2 devices of the same name
https://alice.its.cern.ch/jira/browse/O2-2300 : Usage of valgrind in external terminal: The testcase is currently causing a segfault, which is an unrelated problem and must be fixed first. Reproduced and investigated by Giulio.
Found a reproducible crash (while fixing the memory leak) in the TOF compressed-decoder at workflow termination, if the wrong topology is running. Not critical, since it is only at the termination, and the fix of the topology avoids it in any case. But we should still understand and fix the crash itself. A reproducer is available.
Support in DPL GUI to send individual START and STOP commands.
Problem I mentioned last time with non-critical QC tasks and DPL CCDB fetcher is real. Will need some extra work to solve it. Otherwise non-critical QC tasks will stall the DPL chain when they fail.
DPL sending SHM metrics for all devices, not only input proxy: https://alice.its.cern.ch/jira/browse/O2-4234
Some improvements to ease debugging: https://alice.its.cern.ch/jira/browse/O2-4196 https://alice.its.cern.ch/jira/browse/O2-4195 https://alice.its.cern.ch/jira/browse/O2-4166
After Pb-Pb, we need to do a cleanup session and go through all these pending DPL tickets with a higher priority, and finally try to clean up the backlog.

Global calibration topics:

TPC IDC workflow problem.
TPC has issues with SAC workflow. Need to understand if this is the known long-standing DPL issue with "Dropping lifetime::timeframe" or something else.
Even with latest changes, difficult to ensure guaranteed calibration finalization at end of global run (as discussed with Ruben yesterday).
- Jira with 2 proposals for new EoS scheme: https://alice.its.cern.ch/jira/browse/O2-4308
- After discussion with Giulio, Ruben, David, decided to implement proposal 2
Problem with endOfStream in the middle of a run, stopping calib processing: fixed.

CCDB:

Bump to libwebsockets 4.x / JAliEn-ROOT 0.7.4: Done

Sync reconstruction / Software at P2:

Back to regular updates every monday

CTF Size:

New CTF coding scheme deployed and working.

Async reconstruction

Remaining oscilation problem: GPUs get sometimes stalled for a long time up to 2 minutes.
- Checking 2 things: does the situation get better without GPU monitoring? --> Inconclusive
- We can use increased GPU processes priority as a mitigation, but doesn't fully fix the issue.
Chiara reported again lower performance on both MI50 and MI100 EPNs in async reco, needs to be investigated.
Async reco performance:
- Significant improvements in ITS tracking (both speed and memory)
- Sec vertexing sped up using mass hypothesis
- Still very slow, and GPU mostly idling on EPNs

EPN major topics:

Fast movement of nodes between async / online without EPN expert intervention.
- 2 goals I would like to set for the final solution:
  - It should not be needed to stop the SLURM schedulers when moving nodes, there should be no limitation for ongoing runs at P2 and ongoing async jobs.
  - We must not lose which nodes are marked as bad while moving.
Interface to change SHM memory sizes when no run is ongoing. Otherwise we cannot tune the workflow for both Pb-Pb and pp: https://alice.its.cern.ch/jira/browse/EPN-250
- Lubos to provide interface to querry current EPN SHM settings - ETA July 2023, Status?
Improve DataDistribution file replay performance, currently cannot do faster than 0.8 Hz, cannot test MI100 EPN in Pb-Pb at nominal rate, and cannot test pp workflow for 100 EPNs in FST since DD injects TFs too slowly. https://alice.its.cern.ch/jira/browse/EPN-244 NO ETA
DataDistribution distributes data round-robin in absense of backpressure, but it would be better to do it based on buffer utilization, and give more data to MI100 nodes. Now, we are driving the MI50 nodes at 100% capacity with backpressure, and then only backpressured TFs go on MI100 nodes. This increases the memory pressure on the MI50 nodes, which is anyway a critical point. https://alice.its.cern.ch/jira/browse/EPN-397
TfBuilders should stop in ERROR when they lose connection.

Other EPN topics:

Check NUMA balancing after SHM allocation, sometimes nodes are unbalanced and slow: https://alice.its.cern.ch/jira/browse/EPN-245
Fix problem with SetProperties string > 1024/1536 bytes: https://alice.its.cern.ch/jira/browse/EPN-134 and https://github.com/FairRootGroup/DDS/issues/440
After software installation, check whether it succeeded on all online nodes (https://alice.its.cern.ch/jira/browse/EPN-155) and consolidate software deployment scripts in general.
Improve InfoLogger messages when environment creation fails due to too few EPNs / calib nodes available, ideally report a proper error directly in the ECS GUI: https://alice.its.cern.ch/jira/browse/EPN-65
Create user for epn2eos experts for debugging: https://alice.its.cern.ch/jira/browse/EPN-383
EPNs sometimes get in a bad state, with CPU stuck, probably due to AMD driver. To be investigated and reported to AMD.

TPC Raw decoding checks:

Add additional check on DPL level, to make sure firstOrbit received from all detectors is identical, when creating the TimeFrame first orbit.

Full system test issues:

Topology generation:

Should test to deploy topology with DPL driver, to have the remote GUI available. Status?

QC / Monitoring / InfoLogger updates:

TPC has opened first PR for monitoring of cluster rejection in QC. Trending for TPC CTFs is work in progress. Ole will join from our side, and plan is to extend this to all detectors, and to include also trending for raw data sizes.

AliECS related topics:

Extra env var field still not multi-line by default.

FairMQ issues:

After bump to FMQ 1.8.1 we have 2 issues - Currently investigating this with Alexey
- FST crashes (most likely since FMQ moved ref counts to a separate region, which is probably too small and of fixed size. Will check with larger region. FMQ should increase the default size, and improve the error message).
- Crashes online, probably since EPN did not update the shm-tool. Will try to do today a new build with region size fixed, deploy on staging, and EPN will update shm tool on staging.

High priority RC YETS issues:

Fix crashes at shutdown: All crashes on PDP side fixed. Crashes of input proxy required fix in OCC plugin and FMQ (FMQ fixed in 1.8.1, but cannot use it yet). 2 remaining issues in DataDistribution. Opened JIRA tickets, and assigned to EPN.
Make Start / Stop / Start work: Tests yesterday revealed counters still out-of-sync. Work in progress.
Fix dropping lifetime::timeframe for good: Work in progress, waiting for the 3 debug features in DPL.
Fix CTP QC / allow FIT to send non-raw data from FLPs / update to new 0xDEADBEEF mechanism: Done
Fix problem with input-proxy not dropping data: Fixes deployed with this Monday update, but incomplete. New fixes available, wanted to deploy today, but was cancelled due to FMQ issues.
Expandable tasks in QC: Waiting for Giulio and Barth to investigate the current problem. In principle all is ready from both sides, but requires fixes.
Stabilize calibration / fix EoS: We have a plan how to implement it. Will take some time, but hopefully before restart of data taking.
Fix problem with ccdb-populater: no idea yet, no ETA.

GPU ROCm / compiler topics:

Found new HIP internal compiler error when compiling without optimization: -O0 make the compilation fail with unsupported LLVM intrinsic. Reported to AMD.
Found a new miscompilation with -ffast-math enabled in looper folllowing, for now disabled -ffast-math.
Must create new minimal reproducer for compile error when we enable LOG(...) functionality in the HIP code. Check whether this is a bug in our code or in ROCm. Lubos will work on this.
Found another compiler problem with template treatment found by Ruben. Have a workaround for now. Need to create a minimal reproducer and file a bug report.
Debugging the calibration, debug output triggered another internal compiler error in HIP compiler. No problem for now since it happened only with temporary debug code. But should still report it to AMD to fix it.
New compiler regression in ROCm 5.6, need to create testcase and send to AMD.
AMD provided custom ROCm 5.7.1:
- They did not manage to fix the compiler issue yet, thus the new ROCm has an enviromnet variable option, that restores the old behavior for register spilling, which is not broken for us. Final fix is still in development.
- Unfortunately, there is a new regression in ROCm 5.7.1, and we cannot use it. Created a reproducer and send bug report to AMD.
- Didn't check yet if any of the other pending ROCm issues are fixed with 5.7.1

TPC GPU Processing

Bug in TPC QC with MC embedding, TPC QC does not respect sourceID of MC labels, so confuses tracks of signal and of background events.
Online runs at low IR / low energy observe weird number of clusters per track statistics.
- Problem was due to incorrect vdrift, though it is not clear why this breaks tracking so badly, being investigated.
- vDrift was off so much, that many tracks were cut away by eta cut, and were longer than 250 cm in z and thus track following was aborted.
Ruben reported an issue with global track refit, which some times does not produce the TPC track fit results.
- All issues either fixed, or understood and originating from differences in the original track paramters when the fit starts.
New problem with bogus values in TPC fast transformation map still pending. Sergey is investigating, but waiting for input from Alex.
Fixed a bug where GPU Standalone Multi-Threading would segfault when shutting down the environment, since the asynchronous thread was still accessing the FMQ device before stopping.

Issues currently lacking manpower, waiting for a volunteer:

For debugging, it would be convenient to have a proper tool that (using FairMQ debug mode) can list all messages currently in the SHM segments, similarly to what I had hacked together for https://alice.its.cern.ch/jira/browse/O2-2108
Redo / Improve the parameter range scan for tuning GPU parameters. In particular, on the AMD GPUs, since they seem to be affected much more by memory sizes, we have to use test time frames of the correct size, and we have to separate training and test data sets.

11:20 → 11:25

TRD Tracking 5m

Speaker: Ole Schmidt (CERN)

11:25 → 11:30

TPC ML Clustering 5m

Speaker: Christian Sonnabend (CERN, Heidelberg University (DE))

Recent projects

Check neural network performance in 3D
Check neural network speed on GPU
Implement looper tagger
Extract track parameters for each cluster and check performance e.g. pT differential

1. Neural network performance in 3D

Testing 7x7x7 input, boundary implemented between IROC and OROC1 (due to different pad sizes)
Comparison for classification with 2D case with variation of threshold
3D network has clear potential to distinguish between fake and real clusters
Fake-rate falls steeper than efficiency -> Slight compromise in efficiency can lead to strong reduction in fake-rate

2. Check neural network speed on GPU

Goal for Run 4: 50 mio. clusters / s of processing speed
With improvements for GPU's and parallel processing, goal to be reached now ~10-20 mio. clusters / s
Neural network classification: Float16 should be good enough, training scripts implemented & working
With some model optimizations (e.g. matrix multiplies as multiples of 4 or 8 for tensor optimizations): Reaching ~28 mio. clusters / s (fp16) and ~22 mio. clusters / s (fp32) on Nvidia Tesla V100 PCle (aliceml server)

3. Implementation of looper tagger

Create vector of size (max(time) / granularity, row, pad)
Accumulate all MC labels within "granularity" into this vector
Check
for t in time_window:
for p in pad_window:
if any label occurs more than n times:
tag region charge_map[row][pad][time : time + time_window] as looper

Need to add check for charge decrease -> low momentum particles can have maxima in ion tail
Settings: granularity = 10, time-window: 50, pad-window: 5, threshold (n): 4
Before: 94.0% efficiency, 19.2% fakes; After: 94.1% efficiency, 17.4% fakes

4. Extract track parameters for each cluster and check performance e.g. pT differential

Tracks extracted from collisioncontext.root (thanks to macro from Sandro)
For MC label (trackID) we can associate the (e.g.) p_T of that track to the cluster, since we have the MC label of the cluster too
Implemented, leading to further investigation of high p_T tracks now (that's where clusterization has to work)
Sample training data according to physics objectives

11:30 → 11:35

ITS Tracking 5m

Speaker: Matteo Concas (CERN)

ITS tracking: updated to the latest ITS tracking developments
HIP as a language: GPUReconstructionHIP lib to be tested

Choose timezone

Alice Weekly Meeting: Software for Hardware Accelerators / PDP-SRC