Weather related power dip on May 2nd, approx ⅓ of tier 1 was affected (by core count)
Lost power to 1 row of compute for a few minutes at ~02:30 (eastern time)
Received notification, onsite work was done to recover the lost portion of the condor pool
~99% recovery completed by 05:00, 100% recovery by 10:00
Initial testing work has begun on revised condor memory (cgroups) config which should better protect worker nodes (EPs) from becoming completely exhausted of memory
These changes don’t affect the Tier 1 (yet) but are on the horizon. Currently being rolled out on one of our other pools
Also Storage: (I dont have permissions to add there)
Infrastructure
The local OSG CAs Puppet class has been improved, enhancing CRL updates and repository management.
Monitoring
Integration of various dCache components into the ELK infrastructure is underway.
Pools are currently being integrated to complete the deployment of Filebeat.