1) MSU resolved the cooling issue in the server room

2) 18th May, the core switch (Mellanox 2700) died from storage device issue. To recover, we used the new Dell S5232F switch to replace it. However afterwards, we see a lot of packets loss among different hosts, we had to connect all the other switches  and core service hosts directly to the new switch. It seems to be a spanning tree issue between the older Dell switches (OS can't be updated to OS10) and the new switch. We can't resolve this issue, especially we have planned to replace all switches in middle June. 

During this incident, we lost 3 dcache pools with 150TB data from an aging and no warranty MD3260 storage enclosure. (The vdisk failed, and could not be recovered  without technical support)We declared the lost files. 

One big challenge to recover is from vmware, one of the vmware node has trouble to boot into its system(due to network issue), we had to migrate the images from this host to the other 2 , and we are still working with the vmware support to bring this host back to the cluster.

This incident cause a downtime of 6 days.  We still have ~30 work nodes suffering from significant packet loss, but the job failure from ATLAS is low (~3%). More impact is on the Tier3 jobs. 

3) Had difficulties setting site downtime.  Didn't seem to match CERN and file access expectations.
Part of that was operator problem (Philippe) until he started using the proper OSG wiki info.
Part of that must be some missing topology entries (more services besides gridftp in SE?) or advertisement between OSG and CRIC.

 

[1]MSU CRAC problems: We replaced the control board of CRAC#2 as it was still operating but could no longer read its temperature sensors.
CRAC#2 was relying on CRAC#1 for control, thus being a potential single point of total failure if CRAC#1 ever stopped. This scenario was deemed too dangerous, even for the couple months we have left, thus we paid for a control board replacement.
Separately, CRAC#1 turned off one of its compressors last Sunday. That's a repeat of 1 month ago.
These concerns will disappear when we move to the MSU data center in June-July.