• SWT2_CPB_K8S_TEST cluster was rebuilt, and new nodes were added to the cluster. We are at above 1K running job slots right now (were slightly below 1K on pre-scrubbing plots)
  • Meanwhile the SWT2_CPB_K8S cluster was drained. 
  • We switched the configurations in the CRIC and Harvester, for the new cluster to continue run under the SWT2_CPB_K8S queue. And the SWT2_CPB_K8S_TEST queue was disabled.
  • The new SWT2_CPB_K8S cluster is running fine.
  • As I was monitoring it, I noticed an artificial peak in the slots of running jobs. And that bump present in all grid sites as well, so opened a SNOW ticket:  https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2335613 .  It appeared that the collecting agent restart resulted in duplicated record ...
  • Got from Patrick a node which was not used in production, to use it to reinstall Prometheus on it, as a dedicated node.
  • Also looking into job accounting reporting, available options ...