Scaling in traccc with vecmem v.1.23 with CUDA event caching
Much better scaling as-is.
L40
.png)
H100
.png)
- cudaLaunchHostFunc seems costly for suspension synchronization. Events investigated as a replacement.
- Delegation does help mitigate the suspension cost but TBB seems to hang in some combinations of threads/slots. Alternative to investigate: delegate to a bare thread instead of a TBB 1-thread arena.