SWT2_CPB:

Experienced some errors with long running jobs hitting max walltime limit, but this has essentially stopped now.
We experienced a dip in running jobs on 4/24. We did not make any changes and only investigated. It appeared to resolve itself on 4/25. We did not see any clear issues locally when reviewing logs.
Other than this, we have been running well and staying full the past two weeks.
Most of our jobs have returned to being majority production jobs after the transition from SWT2_CPB_TEST to SWT2_CPB. We had majority analysis jobs with very little production jobs for a moment.
We had several worker nodes that got removed for various reasons in Slurm, but this may have been due to the job mix. We investigated these, rebuilt them, and added them back into service. So far, we have not seen any more issues with these nodes.
Noticed low transfer efficiency as destination site from Spain (ES) on 4/28. Requested assistance/info from DPA.
Renewed certificate that is going to expire soon for an XRootD Proxy server. Will be implementing these soon.

Test cluster is complete. We are waiting for assistance to add our new test CE and XRootD proxy server to CRIC properly, so we can see if test jobs complete successfully with our new EL9 modules.

We want to test these modules and improve them further before implementing them in the production cluster.
Storage module is complete. It requires testing.
The XRootD Proxy module is complete, but continually improving. It requires testing.

We physically installed all new storage servers.
We found a way to avoid purchasing new rails or racks to address long storage rails issues. We slightly modified some of our racks to enable us space to install new storage without problems.
Configured RAID and iDRAC on all new storage.
Once tests in the newly finished test cluster show all is fine with storage modules, we will begin next steps in deploying new storage.

We are continually developing new internal monitoring to better troubleshoot our CE and Slurm.
We are currently testing out different tools and having internal discussions on this.

Followed up again with campus networking concerning this ticket recently. We are waiting for a response. This includes a reminder of our request for this ticket and references to details of our last meeting.

ESNET’s network experts have begun working on the connectivity problem between SWT2 and GOEGRID. We have repeated the tests from SWT2’s side and requested GOEGRID to perform the same tests. We first requested these tests via email, then via ticket.

OU:

Mostly running well
Still working on SLURM network issue which occasionally drops nodes; admins believe they are close