Meeting ID: 996 1094 4232
Meeting password: 125
Updates on US Tier-2 centers
Getting ready for downtime and infrastructure update starting 8am Monday 14-Jun-2021
- Updating the topology in OSG/git to prepare downtime (thanks to Brian, Ofer and Mark)
- UM will (finish to) replace all the main switches, all cabling (all fiber where possible), and configuration.
Some servers and worker nodes will need to be relocated.
- MSU will be moving all services and dcache storage to the MSU Data Center Monday-Tuesday (Wave 1)
to coincide with UM downtime.
Our public and private networks are now extended from our old EX9208 to our 2 new QFX5120 at the DC.
2 nodes (Wave 0) were moved this Monday to iron out networking issues.
Some multicast (for ganglia) and stability issues were discovered and fixed.
The T2 WNs (and the MSU T3) will be moved over time (Wave 2, etc).
Moving the last set of worker nodes will need to be synchronized with the move
of the department servers sharing the same cooling (otherwise CRACs would fail on too-cold air return).
- The UM-MSU link will unfortunately not be switched over to the new State Research Coridor Triangle at this time.
The MSU multi 100G Research Network will also not be ready for cut-over until at least July.
- Optimistically we may have dCache back on Wednesday
UC - loss of IPv6 connectivity to PanDA took site offline last Thursday until Sunday. Monday had loss of IPv6 connectivity from IU/UIUC to UC.
IU - new head node and perfSONAR servers are racked and ready to be brought online. Squid degradation for an expired k8s certificate on iut2-slate.
UIUC - working towards adding ICC's HTC resources.
Dealing with an issue right now....
GPFS getting overloaded => Hammercloud bounced us a couple of times yesterday. Currently ~30% ddm failures. Having to reboot one of our gridftp endpoints. Possibly DAOD physics validation jobs?
xrd 5.2.0 with clustering and custom containers is working for the GPFS storage.
Preparing to buy worker nodes.
Pilot is failing to create log files at NERSC for a small percentage of jobs. This appears to block stage out for the entire job, I am investigating. Have been in communication with Tadashi regarding the stager. Still not understood why the pilot is failing to create a log file - I will also email Paul if there are issues after updating the pilot to latest version. Have reduced Harvester to a single queued job until I can reliably get jobs to complete again.
TACC is down for maintenance which is taking much longer than expected.
Alarms & Alerts