Unplanned power interruption (morning of 3/7)
Most things recovered in a few hours, a few VMs took several hours to recover
Casualties: NIC on one OpenShift worker (few days to replace), a few VM disk images were corrupted (one needs to be rebuilt, another copied from RHEV again since it was recently migrated the old image was still present)
OpenShift: more than half migrations from RHEV complete, close to supporting containers (ready for testing very soon)
BNL_ARM - it was not getting new jobs due to missing SW tags in CRIC. Solved.
Unplanned power outage on friday 7 Mar.
This lead to a large number of job failures as workers lost power
HTCondor recovery by ~15:45 (eastern time)
Job ramp up was gradual, but successful
Some worker nodes came up in a bad state and were rebuilt. Full capacity restored
There was an effort to recover additional previously downed worker nodes, capacity is slightly higher post- power outage as a result of This effort (34.2k core -> 35.4k core)
Power glitch outage on 03/07/25.
The ATLAS production storage service was degraded
The Chimera server was down for 7 minutes but restarted without any issues or corruption.
Other dCache core services failed over to redundant components.
A mix of pool hosts restarted automatically, while a few others required manual hardware intervention. No data loss was observed.
A subset of doors were also affected and recovered without issue
The impact was limited to some READ operations and READ/WRITE transfers that were in progress during the power glitch.
The system was fully functional by 11 AM (EST).
Test/Integration instance affected due to the OpenShift issue
Work on DMZ Pools: The underlying filesystem block size of DMZ pools has been aligned with the NVMe-based block size, resulting in an improvement in READ IOPS.
All operations’ related news were already reported above.