On OSG 3.6, for gatekeepers and worker nodes.
We broke frontier squids while trying to fix gratia probe problems.
Our first fix attempt inadvertently re-enabled a local setup script overriding squid location variables.
Gratia issues solved: directory ownership was root instead of condor.
2 tickets:
156868 15-Apr-2022 AGLT2: Failing jobs in panda with "Unable to identify specific exception"
156873 17-Apr-2022 US AGLT2: High Transfer failures as source
The job problems was traced to time outs during stage-out.
There was no clear problem but the likely suspect was dcache and java running out of memory.
We increased the memory for webdav on the doors and dCacheDomain on the headnodes.
Also added CPUs and memory to the VM doors. That all helped.
We also upgraded dcache from 6.2.35 to 7.2.15 (since we had to restart to load new CA certs anyway)
The issues from both tickets disappeared after that.
Maintenance:
mostly through updating all worker nodes for new kernel, Dell FW updates, OSG updates (cvmfs)
Network upgrades completed * and tested * :
All new multi-path and multi-100G connections to ESnet and between MSU and UM are now fully deployed
and were tested for proper failover in case of backhoe vs fiber incident.