<p>incidents:</p>

<p>21th April, one of our new R740x2d dcache server died, the daughterboard was burnt, we got it replaced within 48 hours with dell sending an onsite technician. &nbsp;Before that, we submitted a JIRA ticke to declare the unavailability of the files.</p>

<p>Services:</p>

<p>We still see jobs get killed due to OOM, 200 jobs/2 weeks. This mostly happens to work nodes with less than 2GB/core, we are in the process of 1) adding more memory to work nodes with retired parts 2) disable HT for work nodes witout spare DIMM parts.</p>

<p>We see 60% of the cluster is being used by the analysis jobs, this might be caused by our recent reconfigurtion of condor and gatekeeper in order to balance giving enough cores to covid-19 jobs and having less fragementation in condor cores. &nbsp;Too many analysis jobs seem to increase the failure rate of jobs in the site.&nbsp;&nbsp;</p>

<p>Condor is updated to 8.8.8</p>

<p>&nbsp;</p>

<p>Hardware:</p>

<p>Retired 20TB usable space from dCache to get spare parts to cover the storage enclosures not under warranty anymore.&nbsp;</p>

<div id="vidyowebrtcscreenshare_is_installed">&nbsp;</div>