Stable running, overall.
1. 24-Aug had ~250 jobs failing due to stage-out error.
This is caused by 3 work nodes which had IPV6 issues, they could ping gw, but not some dcache servers.
We added them to the offline nodes with IPV6 issues, hopefully this can be resolved after getting rid
of the Shinano border switch
2. 30-Aug: sites started to drain because of accumulated transferring jobs (3750, exceeds the limit of 3000 set in CRIC).
The accumulated transferring is destinated to the Napoli site which is currently on unscheduled downtime.
The transferring limit is raised to 4000 , and the jobs are slowly ramping up.
3. We noticed frequent work nodes crashing caused by BOINC jobs (with squashfs error flooding the /var/log/message),
as a workaround, redirect the squashfs message to another log file and use logrotate more often.
ATLAS@home also released a new version, which does not seem to solve the problem.
We are also testing removing squashfs/singularity from work nodes,
to force the BOINC jobs to use the cvmfs singularity image.
4. MSU site migration to campus data center complete (T2 and T3).
Now 2x100G to Chicago and ESnet.
Old room in dept building emptied, now used to test cables for IceCube.
Will ship EX9208 parts to UC.
Issue with "Export Control" understood.