WBS 2.3.1.2 Tier-1 Infrastructure - Jason

Unplanned power interruption (morning of 3/7)

Most things recovered in a few hours, a few VMs took several hours to recover
Casualties: NIC on one OpenShift worker (few days to replace), a few VM disk images were corrupted (one needs to be rebuilt, another copied from RHEV again since it was recently migrated the old image was still present)

OpenShift: more than half migrations from RHEV complete, close to supporting containers (ready for testing very soon)

WBS 2.3.1.3 Tier-1 Compute - Tom

BNL_ARM - it was not getting new jobs due to missing SW tags in CRIC. Solved.
Unplanned power outage on friday 7 Mar.

This lead to a large number of job failures as workers lost power
HTCondor recovery by ~15:45 (eastern time)
Job ramp up was gradual, but successful
Some worker nodes came up in a bad state and were rebuilt. Full capacity restored

There was an effort to recover additional previously downed worker nodes, capacity is slightly higher post- power outage as a result of This effort (34.2k core -> 35.4k core)

WBS 2.3.1.4 Tier-1 Storage - Carlos

Power glitch outage on 03/07/25.

The ATLAS production storage service was degraded
The Chimera server was down for 7 minutes but restarted without any issues or corruption.
Other dCache core services failed over to redundant components.
A mix of pool hosts restarted automatically, while a few others required manual hardware intervention. No data loss was observed.
A subset of doors were also affected and recovered without issue
The impact was limited to some READ operations and READ/WRITE transfers that were in progress during the power glitch.
The system was fully functional by 11 AM (EST).
Test/Integration instance affected due to the OpenShift issue

Work on DMZ Pools: The underlying filesystem block size of DMZ pools has been aligned with the NVMe-based block size, resulting in an improvement in READ IOPS.

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

All operations’ related news were already reported above.