incidents:

21th April, one of our new R740x2d dcache server died, the daughterboard was burnt, we got it replaced within 48 hours with dell sending an onsite technician.  Before that, we submitted a JIRA ticke to declare the unavailability of the files.

Services:

We still see jobs get killed due to OOM, 200 jobs/2 weeks. This mostly happens to work nodes with less than 2GB/core, we are in the process of 1) adding more memory to work nodes with retired parts 2) disable HT for work nodes witout spare DIMM parts.

We see 60% of the cluster is being used by the analysis jobs, this might be caused by our recent reconfigurtion of condor and gatekeeper in order to balance giving enough cores to covid-19 jobs and having less fragementation in condor cores.  Too many analysis jobs seem to increase the failure rate of jobs in the site.  

Condor is updated to 8.8.8

 

Hardware:

Retired 20TB usable space from dCache to get spare parts to cover the storage enclosures not under warranty anymore.