The cluster is running fine. There was an incident overnight to this Monday, when the cluster drained. It appeared the K8s certificates have expired. After renewal things came back normal and jobs running fine.
Looking into possible configuration tuning of the kube-scheduler. A possible candidate was inter-pod affinity for SCORE and MCORE jobs, which may have helped in some scenarios. A concern is the IO performance when packing the nodes with just SCORE jobs. Then noticed a warning in the K8s documentation that the inter-pod affinity requires substantial amount of processing which can slow down scheduling in large clusters significantly - not recommended using in clusters larger than several hundred nodes.
Trying to optimize the job CPU requests coefficient sent from Harvester (has 0.9 scale down value as default). The idea is to not overcommit the node CPU. For now changed the value in CRIC to 0.94 , and things so far look fine.
Next big step is to merge SWT2_CPB_K8S cluster with SWT2_CPB (see Patrick's report).