08/23

The UM network expert helped the UM site to do a system update on the Cisco border switches, to address the issue we noticed on 08/05 with the power outage(the management VLAN stopped working after power cycle of the switches). The update went smoothly.

08/24

One of the management switches in the UM Tier3 server room (sw11-m-01) lost connectivity, and a power cycle brought it back online. No services were impacted by this switch’s downtime.

8/28

The UM campus network went completely down due to security concerns, and the AFS servers of our site are hosted on the grid.umich.edu which is part of the UM campus network, so the UM login nodes were not accessible. Also 4 dCache pools (msufs17/21/23, umfs40) nodes had the pool services offline in the same time window, the reason was unknown, and the fix was to restart the pools. The UM site also took the opportunity to migrate the 6 AFS servers from the grid.umich.edu domain to the aglt2.org domain to avoid future disruptions caused by the campus network. This migration involves lots of detailed work and it took us 4 days until we can fully recover everything in AFS, but we should have a more coherent (to the AGLT2 network) and robust AFS system in the future.

9/6 and 9/8

The UM site received the 2 replacement PDUs from cyberpower, and we replaced the 2 bad PDUs from Rack 4, which caused the dCache pool node umfs30 to be offline for half an hour.

9/9

OSG released a bad certificate rpm (osg-ca-certs-1.114-1), our cert hosting node gate01 picked it up the bad version from auto update, and soon OSG released a new fixed version osg-ca-certs-1.114-2, and gate01 also picked up the update, we also restarted all dCache head and door nodes as recommended by the OSG, but during the weekend, there is still a 60% transfer failure, It turns out there are some bad PEM files from the bad version of osg-ca-certs, which contains duplicated certificate files, and we had to reinstall the osg-ca-certs-1.114-2, and get the right PEM files.