Framework

 

GPU timing test

(see debugLevel=0 folder) Classification network + CF regression
-> Larger batch size is better, but also: Almost no difference in wall time from L2_N16 to L5_N32 for almost any batchsize -> L5_N32 will be the network of choice
 
(see debugLevel=1 folder) Classification network + CF regression
-> Larger batch size is better. Increase from L2_N16 to L5_N32 is factor 1.4 but typically largest jumps noticeable from 32 to 64 neurons per layer (potentially connected to cross-warp computations?)
 
Memory consumption increases with larger batchsize: Input = BS x sizeof(float16) x (e.g. 3x9x9 + 3) x #streams + ONNX. For batchsize of 262144, 3 streams: Input ~387MB + ONNX
 
Physics
 

 

Raw efficiency without loopers: How many clusters have been attached to at least one ideal (MC) cluster, removing all  (local) regions in loopers

Another possibility: #(attached ideal clusters, not in looper tagged region) / #reconstructed clusters

 

To-Do