Meeting ID: 996 1094 4232
Meeting password: 125
Reminder - HPSS (tape system) downtime 2-Aug-21 through 7:00 pm - 5-Aug-21
Updates on US Tier-2 centers
1) MSU site is moving 65 WNs to the new DC, i.e. all the newer WNs (R620s, R630s, C6420s).
2) UM site is working on the ipv6 issues on the new network. 2 causes, we solved one set of the problem by adding static IPv6 ND mapping to the gatekeeper, still working on set 2 problem from the R620s connected their data cables to the management switches
3) Job failures: 40% failure on 20th July due to 2 errors, "payload metadata does not exit", which disappeared on 21st July. (AGLT2 has the biggest number of failed jobs for this error within usatlas, but some other sites have similar errors). "no local space" error, the home directory for the usatlas users are full after years of piling up of small files, we cleaned the space and set up a cronjob to clean it.
details about 2)
More work nodes are having ipv6 connectivity issues (do not reach gw), there are 2 set of causes: one is possibly by a bug in either the juniper or the cisco switch border switches. The workaround is to add the static ipv6 ND mapping to the juniper gateway. (We have added all work nodes). Hopefully this will be resolved when we can get rid of the juniper gateway (using cisco instead) in August. Two is the management switches (S3048) have ipv6 issues. We have ~20 R620s which need to connect to the management switches for data connections, we havn't found a solution to that yet, so retired condor on all R620 work nodes for now
GGUS tickets: 0
HC blacklists: 0
o Smooth operations
o Site full except for a dip around 2021-07-15 (unknown if it's a widespread dip)
o Advanced stages of getting ready to buy worker nodes.
o xrd 5.3.0 installed and working in our custom container
o Successfully exporting NET2_DATADISK, _SCRATCHDISK, _LOCALGROUPDISK
o Endpoint atlas-xrootd.bu.edu registered in CRIC
o Configured for HTTP-TPC, custom adler32, both work successfully
o Getting put into "smoke tests" by Alessandra & co.
o Some problems remain, possibly related to transfers to dcache sites, Wei and Andy are investigating.
o NESE Tape ATLAS endpoints have arrived, expect to be racked and cabled this week.
o perfSonar node rebuilt with new hardware, both nodes are ipv6 now.
o Annual MGHPCC power maintenance, August 9
Setting up a test host, as proxy, with test version of XRootD 5.3 from OSG. Software installed, working on configuration.
Operations mostly smooth over period
- Smooth operations, ran low on jobs occasionally.
- XRootD 5.3.0 installed, HTTP-TPC working, waiting to be included in smoke tests.
A few more compute nodes showed up and have been racked, built, etc. and added to the cluster.
A couple more interactive machines showed up, but haven't been racked and built yet. These (along with the three machines mentioned above) aren't necessary for us to go into production.
Still waiting on the GPU machine. We believe sometime in November is when it will arrive (according to Dell).
We've gotten a condor queue up and running. Can submit jobs from both submit hosts we're planning to have for users day 1.