Incidents:

2 GGUS tickets:

1) over 50% failure on the analysis queue, it was auto excluded. We verified that there was nothing wrong with the site, the job failure was related to the job itself. We requested to closed the ticket, but no body closed it, it was auto closed eventually. 

2) one worker node which is having a different hardware/software configuration for testing purpose causes jobs failing due to using up its cache space in cvmfs.. We excluded the node from HTCondor and closed the ticket. The node will not be put back online until more memory is added. 

Service:

In order to solve a bug in the Condor quota which affected our Tier 3 users with small submissions, we upgraded condor from 8.4.11 to 8.6.13 (the most recent stable one). 

We did not schedule downtime, but did the upgrade in 4 batches on the cluster. For the work nodes, it requires to retire and drain the condor jobs first before doing the upgrade. The overall process took about 5 days to finish. 

Hardware:

MSU had one of the PDUs replaced.