- Understanding the performance of the new cluster, looking into details if something doesn't look right. Overall the production is running fine.
- Noticed a couple of time drop in production level, but it appears to be not specific to K8S cluster, and looks like was due to storage servers getting overloaded.
- With the new hardware, noticed that the nodes with more cpu cores (64/72/96) have overcommiting the node CPU. For the previous cluster I solved this issue by optimizing the job CPU requests coefficient sent from Harvester. Have to look into this, probably readjust it.
- Noticed that K8S was trying to schedule production jobs on the master node. A NoSchedule taint was in place initially but looks like was lost at some point - reinstated.
- Working on reinstalling Prometheus on a dedicated node. And next setting up job accounting reporting.