On OSG 3.6, for gatekeepers and worker nodes.

We broke frontier squids while trying to fix gratia probe problems.
Our first fix attempt inadvertently re-enabled a local setup script overriding squid location variables.
Gratia issues solved: directory ownership was root instead of condor.

 

2 tickets:

156868  15-Apr-2022   AGLT2: Failing jobs in panda with "Unable to identify specific exception"
156873  17-Apr-2022   US AGLT2: High Transfer failures as source

The job problems was traced to time outs during stage-out.
There was no clear problem but the likely suspect was dcache and java running out of memory.
We increased the memory for webdav on the doors and dCacheDomain on the headnodes.
Also added CPUs and memory to the VM doors.  That all helped.
We also upgraded dcache from 6.2.35 to 7.2.15 (since we had to restart to load new CA certs anyway)
The issues from both tickets disappeared after that.

 

Maintenance:

mostly through updating all worker nodes for new kernel, Dell FW updates, OSG updates (cvmfs)

 

Network upgrades completed * and tested * :

All new multi-path and multi-100G connections to ESnet and between MSU and UM are now fully deployed
and were tested for proper failover in case of backhoe vs fiber incident.