Hardware
MSU site installed 3x R740xd2, now doing benchmark on different stripe size of the RAID6.
Service:
Update dCache from 6.2.29 to 6.2.32 to address security issues. The update was smooth.
HS06 benchmark:
We run it on 2 types of CPUs (Intel(R)Xeon(R)Gold6240R and Intel(R) Xeon(R) CPU E5-2650 v2), there are 2 discoveries: 1) the new HS06 score is between 6-8% higher than the old number (with the same benchmark toolkits)but different kernel and firmware. 2) We compare HS06 score with and without BOINC jobs running in the background, and having BOINC reduces the score by 1.32-2
Incidents:
Removing DBRelease file by ADC caused BOINC jobs failing and mis-accounting.
Xcache server sl-um-es4 crashed because of one disk failing. We replaced the disk (raid 0), recreated the instance for xcache.
On 11/7/2021 , from 3am UTC, SrmManager on dcache head nodes (head01) started to fail, and it caused file transfer low efficiency (60% failure), and rucio deletion failure (datadisk was almost full), and the AGLT2 PanDA queue to be also set to test status (this drained the site to only 20% job slot usage) due to over 60% failure on the jobs. We first restarted dcache on the head node, and it fixed the rucio deletion failure issue. Later we still saw low transfer efficiency, and the problem was traced back to one xrootd door(dcdmsu01) and one pool node (msufs04), we restarted dcache on both nodes and the transfer efficiency started to improve, but we saw more errors from other pool nodes, so we ended up restarting dcache on all nodes which eventually solved all the problems.