The Postgresql slave server for the dCache head node lost its hard driver, we have to rebuilt the machine with 2 new hard disks, and we are in the process of restoring the slave database, and setting up the SRMwatch for dcache. We also take this chance to upgrade this host from SLC6 to CentOS 7 and use ZFS to host the Postgresql database.

Some of our blade work nodes have unusual high load (over 200) without any jobs running. And the load goes down when we turn off HTCondor on the work node. Some of the work nodes have disk issues, some of them do not. We can not understand the situation, and we updated a few work nodes to  8.6.12 from  8.4.11 , and give Brian Lin access to these work nodes to debug.

Ref ticket:

https://support.opensciencegrid.org/helpdesk/tickets/7720 

We upgraded dCache from version 4.2.6 to 4.2.14. And we took the chance to upgrade the OS and firmware too. Upgrade on some of the MSU pool nodes did not get well. After upgrading the dCache rpms, the dCache service wouldnot start on the pool nodes, the head node thought there was already pool instance running on the pool node and there are lock files in the dCache pool. We tried various things, including updating/rebooting zookeepers, restarting dcache services on head/door nodes, what eventually fixed the problem was to reinstall the dcache rpm on the pool nodes. 

When we trying to retiring one storage shelf from one of the UM dcache pool node, the wrong virtual disk was accidentaly deleted, we couldn't manage to recover the vdisk,hence lost over 85K files. for this we opened a JIRA ticket and reported the lost files. 

-Wenjing