05/01/2022
There was a scheduled power shutdown in the UM Tier3 server room due to maintenance of the facility, the shutdown lasted 6 hours, a couple of things broke during the shutdown, including the network card for one UPS unit and the containerd/network forwarding service on one of the nodes of the slate kubelet cluster. (The containerd failure was caused by a wrong configuration of the net.ipv4.conf.default.forwarding and net.ipv4.conf.all.forwarding, they should be set as 1). The kublete node problem caused one of the squid servers hosted on the kubelet cluster to be down, and all traffic went to the other squid server and did not cause job failure.
5/02/2022
The slate kubelet cluster node sl-um-es5 reverted the ip forwarding change by cfengine, so the squid service went down again, this caused a lot of BOINC jobs failing as all BOINC clients are configured to use this proxy server. We switched the BOINC proxy server to sl-um-es3, which is located in the Tier2 server room and should be more robust. The BOINC jobs started to refill the work nodes after we changed the proxy. And later we fixed the sl-um-es5 node.
5/5/2022
During our annual renewal of the host certificates, we made a mistake to request the gatekeepers’ host certificates issued by the InCommon RSA instead of by InCommon IGTF, and this started to cause authentication errors on all gatekeepers for any incoming jobs. The change was made late in the afternoon, and the error was not caught until the next morning, so the site got drained overnight. We replaced the RSA certs with IGTF certs on the gatekeepers, and the site started to ramp up. During the 17 hour draining period, BOINC jobs ramped up as we designed, and filled up all the cluster, so the overall cpu time used by ATLAS jobs stayed about the same compared to before the draining.