SWT2_CPB:
Operations
Experienced some errors with long running jobs hitting max walltime limit, but this has essentially stopped now.
We experienced a dip in running jobs on 4/24. We did not make any changes and only investigated. It appeared to resolve itself on 4/25. We did not see any clear issues locally when reviewing logs.
Other than this, we have been running well and staying full the past two weeks.
Most of our jobs have returned to being majority production jobs after the transition from SWT2_CPB_TEST to SWT2_CPB. We had majority analysis jobs with very little production jobs for a moment.
We had several worker nodes that got removed for various reasons in Slurm, but this may have been due to the job mix. We investigated these, rebuilt them, and added them back into service. So far, we have not seen any more issues with these nodes.
Noticed low transfer efficiency as destination site from Spain (ES) on 4/28. Requested assistance/info from DPA.
Renewed certificate that is going to expire soon for an XRootD Proxy server. Will be implementing these soon.
EL9 Migration
Test cluster is complete. We are waiting for assistance to add our new test CE and XRootD proxy server to CRIC properly, so we can see if test jobs complete successfully with our new EL9 modules.
We got hostname and IP assigned in DNS with campus networking.
Received new certificate for test XRootD Proxy and CE.
All required services and appliances are running.
We want to test these modules and improve them further before implementing them in the production cluster.
Storage module is complete. It requires testing.
The XRootD Proxy module is complete, but continually improving. It requires testing.
New Storage
We physically installed all new storage servers.
We found a way to avoid purchasing new rails or racks to address long storage rails issues. We slightly modified some of our racks to enable us space to install new storage without problems.
Configured RAID and iDRAC on all new storage.
Once tests in the newly finished test cluster show all is fine with storage modules, we will begin next steps in deploying new storage.
Monitoring
We are continually developing new internal monitoring to better troubleshoot our CE and Slurm.
We are currently testing out different tools and having internal discussions on this.
GGUS Ticket - Enable Network Monitoring
Followed up again with campus networking concerning this ticket recently. We are waiting for a response. This includes a reminder of our request for this ticket and references to details of our last meeting.
GGUS Ticket - GoeGrid Transfer Failures
ESNET’s network experts have begun working on the connectivity problem between SWT2 and GOEGRID. We have repeated the tests from SWT2’s side and requested GOEGRID to perform the same tests. We first requested these tests via email, then via ticket.
OU: