Workernode Calibration
*******************

Each type of workernode in a cluster must be calibrated to produce a factor that represents the power of the workernode. This power factor can then be used to determine the amount of work applied to a job by multiplying the  time a job took to run by the factor. This is used by APEL Accounting. The power of a workernode is expressed in HEPSPEC06.

This is simplest in a machine with a single cpu and a single core, but now machines contain multiple CPUs, each with multiple cores, and where each core can run 1 or (when hyperthreading is enabled) 2 jobs (note that we assume at least 1 hyperthread per job – less than that would cause cpu sharing, aka time slicing). Thus the number of jobs running on a machine can be anywhere between 1 and cpus * cores * 2. And the power a workernode can apply to a job varies depending on how many other jobs the node is actually running.

Experiments show that  running the maximum number of jobs does not necessarily maximise throughput. In order to fully utilize a system, it may be necessary to choose a number of slots that is higher than the number of cores, but lower than twice the number of cores (i.e. fully hyperthreaded). The reason for this phenomenon is related to contention.

The following procedure is adopted to deal with these complications, and to maximise the overall applied power (i.e.  throughput) of the node while also paying heed to memory constraints.

For each type of node, run an instance of the benchmark for every number between cores and cores*2, to cover the whole area, from ignoring hyper-threads to using all hyper-threads. Compare all the results and select the number of  benchmark instances that gave the maximum applied computing power overall from these scenarios and use that as the number of slots (i.e. logical cpus) for this node type, i.e. chose the sweet spot. Where the sweet spot is flat (e.g. the same overall throughput is obtained with 12, 13 or 14 instances), choose the lowest, because this combines the highest throughput with the most memory available per job. 
And, in any-case,  always chose a number that at least provides adequate memory per job.

This approach maximises use of cpu power in a fully loaded cluster while giving adequate memory for each job.