Framework
- ONNX: Non-deterministic compute on GPU
- More global issue. Tried all variations, but even in float32 using
(mPImplOrt->sessionOptions).AddConfigEntry("session_options.use_deterministic_compute", "1");
the compute was not deterministic.
- Investigation in GitHub threads and forums shows that this is indeed not guaranteed by ONNX (or any other framework), not even during training!
- SetDeterministicCompute: "... If set to true, this will enable deterministic compute for GPU kernels where possible. ..."
- Variations (using only the classification network with fixed threshold):
- L2, N16: 276782342 \pm 105 ~ 3.8x10e-5% deviation
- L5, N128: 265601210 \pm 111 ~ 4.2x10e-5% deviation
GPU timing test
(see debugLevel=0 folder) Classification network + CF regression
-> Larger batch size is better, but also: Almost no difference in wall time from L2_N16 to L5_N32 for almost any batchsize -> L5_N32 will be the network of choice
(see debugLevel=1 folder) Classification network + CF regression
-> Larger batch size is better. Increase from L2_N16 to L5_N32 is factor 1.4 but typically largest jumps noticeable from 32 to 64 neurons per layer (potentially connected to cross-warp computations?)
Memory consumption increases with larger batchsize: Input = BS x sizeof(float16) x (e.g. 3x9x9 + 3) x #streams + ONNX. For batchsize of 262144, 3 streams: Input ~387MB + ONNX
Physics
- Did the QA on the ideal clusters again now compared to ~1.5 years ago
- Changes in between:
- Added looper tagging
- Changed peak finder to the peak finder of the GPU cluster finder (for training data generation)
- Using native clusters from reconstruction, not the clusters generate in the QA macro
- Using simulation which enforces 0-5% centrality, so efficiencies go down a bit, fake-rates go up a bit
- Network size = classification network size, Regression algorithm = GPU CF regression
Raw efficiency without loopers: How many clusters have been attached to at least one ideal (MC) cluster, removing all (local) regions in loopers
- NN can never attach more clusters than GPU CF because it won't produce more clusters (it can only reject)


Another possibility: #(attached ideal clusters, not in looper tagged region) / #reconstructed clusters
- Can be artificially inflated with number of loopers, but in all cases network should reject looping clusters better, so efficiency should go up


To-Do
- Create standalone reproducer for ONNX non-determinism
- Create Efficiency-Fake plots with regression network