05/12/2022

Updated condor from 9.0.11 to 9.0.12

Updated gratia probes on all gatekeepers. Gratia probe stopped working for a day after the upgrade, and it was fixed by reconfigure, and then manually restart condor-ce and  run 

su - condor -c " /usr/share/gratia/htcondor-ce/condor_meter"

 

05/17/2022

We migrated the Tier2 NFS server umfs02 to a virtual machine without having downtime. This nfs server provides the home directory for all grid users. The migration hit some problems: 1) the MSU work nodes could not mount the new NFS server because of routing issues . We added the routing rules as a workaround. 2) This nfs server also serves as the archive directory for the dCache postgresql databases’ hot standby replication. For one of the database servers (head01), the hot standby replication did not have a smooth transition during the 20minutes downtime when the NFS servers were swapping, so we ended up reseeding the database from head01 to its hot standby server d-head01.

 

We converted all 26 remaining SL7 servers at UM site to CentOS7, this includes all the dCache pool nodes and lustre storage nodes. 

 

05/21/2021

The new nfs server(virtual machine) umfs02 lost accessibility, increasing the memory and CPU restored the service.The site drained to 10% usage on 21st because of this incident. 

 

05/23/2022

Gratia on the OSG gatekeeper (gate02) stopped working for 2 days. Restarting condor-ce service fixed that

MSU finished installing and phasing in the 3x new VMware AMD host nodes (ordered Sept 2021).
But still using old direct-attach SAS storage. Last step will be to start using new NVMe storage via iSCSI (also received from 2021 order)