Hardware
3 R740xd2 serves from MSU are in production system. The IO benchmark shows the strip size 512K for RAID6 has the best IO performance, about 10% better.
Incident:
11/22/2021, from 10am local time, the 10G commodity link connecting the AGLT2 UM site to Merit went off, so all nodes on the aglt2.org domain name lost access to the Merit DNS servers. The issue was resolved around 7pm when Merit repaired the hardware connecting to this link. During this window, all data transfers were failing and the site was already drained to 8% because of a planned condor update before the network outage.
dcache pool umfs06_12 caused jobs to fail at staging-in files, restarted the dcache service resolved the problem.
System update:
Conor was updated from 8.8.15 to 9.0.6, and condor-ce was updated from 4.5.2 to 5.1.2. During this update, we switched the authentication from host-based to token-based for the Condor Cluster, and that went smoothly because we already practiced it on a testbed. But we hit an issue with condor-ce after the update, where the condor-ce could see the incoming jobs, but the jobs could not be submitted to the local condor system. It took a few hours debugging to find the cause which is the new htcondor-ce does not support the format for commenting in the job router configuration, and this is already reported as a bug to the htcondor development team. At about 13:00 11/23/2021, the site started to ramp up with jobs. And during the entire period with draining and updating problems, BOINC jobs were able to fill all the job slots of the site.