SWT2_CPB_K8S_TEST cluster was rebuilt, and new nodes were added to the cluster. We are at above 1K running job slots right now (were slightly below 1K on pre-scrubbing plots)
Meanwhile the SWT2_CPB_K8S cluster was drained.
We switched the configurations in the CRIC and Harvester, for the new cluster to continue run under the SWT2_CPB_K8S queue. And the SWT2_CPB_K8S_TEST queue was disabled.
The new SWT2_CPB_K8S cluster is running fine.
As I was monitoring it, I noticed an artificial peak in the slots of running jobs. And that bump present in all grid sites as well, so opened a SNOW ticket: https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2335613 . It appeared that the collecting agent restart resulted in duplicated record ...
Got from Patrick a node which was not used in production, to use it to reinstall Prometheus on it, as a dedicated node.
Also looking into job accounting reporting, available options ...