Understanding the performance of the new cluster, looking into details if something doesn't look right. Overall the production is running fine.
Noticed a couple of time drop in production level, but it appears to be not specific to K8S cluster, and looks like was due to storage servers getting overloaded.
With the new hardware, noticed that the nodes with more cpu cores (64/72/96) have overcommiting the node CPU. For the previous cluster I solved this issue by optimizing the job CPU requests coefficient sent from Harvester. Have to look into this, probably readjust it.
Noticed that K8S was trying to schedule production jobs on the master node. A NoSchedule taint was in place initially but looks like was lost at some point - reinstated.
Working on reinstalling Prometheus on a dedicated node. And next setting up job accounting reporting.