Unexpected downtime over the weekend due to power failure at the data center. Cluster initially failed to recover correctly due to two problematic servers, didn't refill until this was fixed.
Ongoing problem due to the pilot seeing all the disk space on nodes, not just the disk that is available for jobs. On C6320 servers this can lead to the storage space being overcommitted, causing all jobs on the server to fail. A ticket was opened about a couple of tasks but the issue is not related to those tasks but is a more general one. We expect to upgrade the cluster to a new version of OKD fairly soon which should resolve the underlying issue.
BGP tagging of LHCone prefixes should be in place now.