09/16/2021

MSU Dell PO issued.
Missing info to find it on Dell website.
Asked Dell reps, but no success yet.
Only found the WN part with estimate 18-Jan-2022

09/17/2021

Allowed all smaller worker nodes to run BOINC (using larger swap file)
after long test period and measurement and verification of low impact
on Atlas jobs and node stability.

09/20/2021

One of the dCache pools (umfs11_6) went disabled again (twice in 2 weeks).
we repaired the file system first, then started the pool.  
The disabled pool caused 110 failed jobs for staging-out files.

Finally we decided to retire this pool and another pool on the same host
because they each had unresponsive and pending failure disks
which we are not planning to replace anymore.
(This whole storage node was already targeted for retirement
as soon we get our new storage nodes, now estimated Jan 2022 ).
With some struggle(the pool would disable itself during draining),
we finally drained and retired the pool umfs11_6.

Eventually we found that over 11K files were lost during the xfs_repair,
we declared the lost files in JIRA ticket ATLDDMOPS-5575 on 09/29/2021.

09/22/2021

Updated dcache from 6.2.25 to 6.2.29 (for new SRR support).
We also did system firmware and software (including kernel) update and rebooted all dCache servers.
Two dCache storage nodes (umfs11 and umfs19) had corrupted grub configuration files
we had to mount an ISO file to recover the grub file.

09/22/2021

Also applied new firmware and kernel updates on worker nodes,
and drained and reboot the nodes in batches.