- SWT2_CPB_K8S_TEST cluster was rebuilt, and new nodes were added to the cluster. We are at above 1K running job slots right now (were slightly below 1K on pre-scrubbing plots)
- Meanwhile the SWT2_CPB_K8S cluster was drained.
- We switched the configurations in the CRIC and Harvester, for the new cluster to continue run under the SWT2_CPB_K8S queue. And the SWT2_CPB_K8S_TEST queue was disabled.
- The new SWT2_CPB_K8S cluster is running fine.
- As I was monitoring it, I noticed an artificial peak in the slots of running jobs. And that bump present in all grid sites as well, so opened a SNOW ticket: https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2335613 . It appeared that the collecting agent restart resulted in duplicated record ...
- Got from Patrick a node which was not used in production, to use it to reinstall Prometheus on it, as a dedicated node.
- Also looking into job accounting reporting, available options ...