Framework

ONNX: Non-deterministic compute on GPU
- More global issue. Tried all variations, but even in float32 using
  
  (mPImplOrt->sessionOptions).AddConfigEntry("session_options.use_deterministic_compute", "1");
  the compute was not deterministic.
- Investigation in GitHub threads and forums shows that this is indeed not guaranteed by ONNX (or any other framework), not even during training!
- SetDeterministicCompute: "... If set to true, this will enable deterministic compute for GPU kernels where possible. ..."
Variations (using only the classification network with fixed threshold):
- L2, N16: 276782342 \pm 105 ~ 3.8x10e-5% deviation
- L5, N128: 265601210 \pm 111 ~ 4.2x10e-5% deviation

GPU timing test

(see debugLevel=0 folder) Classification network + CF regression

-> Larger batch size is better, but also: Almost no difference in wall time from L2_N16 to L5_N32 for almost any batchsize -> L5_N32 will be the network of choice

(see debugLevel=1 folder) Classification network + CF regression

-> Larger batch size is better. Increase from L2_N16 to L5_N32 is factor 1.4 but typically largest jumps noticeable from 32 to 64 neurons per layer (potentially connected to cross-warp computations?)

Memory consumption increases with larger batchsize: Input = BS x sizeof(float16) x (e.g. 3x9x9 + 3) x #streams + ONNX. For batchsize of 262144, 3 streams: Input ~387MB + ONNX

Physics

Did the QA on the ideal clusters again now compared to ~1.5 years ago
Changes in between:
- Added looper tagging
- Changed peak finder to the peak finder of the GPU cluster finder (for training data generation)
- Using native clusters from reconstruction, not the clusters generate in the QA macro
- Using simulation which enforces 0-5% centrality, so efficiencies go down a bit, fake-rates go up a bit
Network size = classification network size, Regression algorithm = GPU CF regression

Raw efficiency without loopers: How many clusters have been attached to at least one ideal (MC) cluster, removing all (local) regions in loopers

NN can never attach more clusters than GPU CF because it won't produce more clusters (it can only reject)

Another possibility: #(attached ideal clusters, not in looper tagged region) / #reconstructed clusters

Can be artificially inflated with number of loopers, but in all cases network should reject looping clusters better, so efficiency should go up

To-Do

Create standalone reproducer for ONNX non-determinism
Create Efficiency-Fake plots with regression network