Site blacklisted on Saturday 12-Jul
Both DATADISK and SCRATCHDISK filled with dark data.
Post understanding:
dcCache space reporting or file deletion was not working.
11-Jul rucio/DDM saw space exceeding threshold and started deleting cached files without seeing reduced usage
That deleted-but-not-reported space became ~2.5PB of "dark data"
Resolution Mon 14-Jul
Restarted dCache and added 0.5PB
Dark data disappeared shortly after
Cause:
Suspect some dCache service was not working since return from 8-Jul downtime for street water works
Issue with A/R tests on gate01 started 17-Jul and resolved 18-Jul
First suspected issue with updating certificates, but not the root cause
Updated software and rebooted. Recent etf jobs were in state I=idle in condor-ce (not R=running or C=completed).
Found condor-ce service had restarted on 16-Jul 16:49 and condor service not running.
Enabling and starting condor service let the etf jobs get to running state and resolve A/R issue.
Cause was not clear but ansible had started a run right before that time which didn’t seem to complete.
A/R test were failing but site was otherwise working. Will need to request correction.
EL9 at MSU:
All issues with RedHat Satellite have been resolved.
All FY24 equipment in production since 14-Jul
Issue with incomplete dCache pool draining (with pinned&locked) files resolved
Remaining files were from Dec 6-10 database loss problem.
Un-pinning (rep set sticky -all off) allowed sweeper to purge them.
Status as of today for storage:
78% of nodes / 84% of space already upgraded; should finish today.
Then will continue migrating off from EL7 storage to be decommissioned.
Status as of today for compute:
69% of nodes / 80% of HEPspecs already upgraded; should essentially finish this week.
One set of 8x R6525s needs to be moved and re-addressed to workaround local airflow problem.
Then remaining old EL7 WNs will be decomissionned.