SWT2_CPB:
DNS Issue (External Change) - Drain
Campus networking performed work on Sunday early morning (2/22/25) that caused inbound packets to the data center to be blocked. It was a routing problem. This indirectly impacted DNS, which led to various issues and draining.
We noticed this Sunday morning, investigated, discovered issues with DNS, then implemented a temporary fix that afternoon in order to receive jobs again.
Contacted campus networking Monday, held a meeting to work together in troubleshooting, and they resolved the routing issue on the campus router.
EL9 Performance
Other than the DNS issue, we have been running incredibly with new EL9 nodes. We have been running between 16K to 18K cores with very low error rate for production jobs.
EL9 Next Steps
Continuing to develop the EL9 test cluster to be in a better position to start developing the rest of the EL9 appliances and testing. Currently, it is hybrid EL7 and EL9, similar to the production cluster.
New Storage Deployment
Issues with rails sent by Dell that are too long for our racks. We installed one storage in order to test these, and are purchasing third-party rails to test if they will work better for us.
We have 8 MD3460 RAID arrays to replace, and 12 new storage nodes.
Plan is still in discussion, but we plan on deploying four new storage as EL7, two new storage in the test cluster for various testing for migrating from EL7 to EL9, and the remaining four will be used to gradually replace the old MD3460s.
Plan to have the four storage nodes deployed as EL7 within the next month, but the rest of the deployment will be more gradual.
Procurement
Planning potential purchase for new hardware for replacing head nodes and for improving network infrastructure (switches).
OU: