Incidents

1) the gatekeeper which receives the HC jobs stopped working , we did not have monitor for the condor-ce service, so did  not realize it right away.

2) a big portion of analysis jobs fail, the site gets a  ggus ticket. We found out one script we use to clean up the zombie files left by killed job by HTcondor accidentally deletes the work dir of running jobs too. This was a bug in the script when it switches to pilot2. We fixed the bug in our script. 

Hardware:

Sorted out the storage servers we could retire from Tier2 according to the age of the hardware and also the number of failures on the hardware. Figure out the items for the purchase.

 

Service

Setup a new replication server for dCache database