- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
OSG 3.6/3.5.32 (this week)
OSG 3.6/3.5.33 (next week)
Other
Updates on US Tier-2 centers
Update
1) both UM and MSU sites updated both core switches to latest JunOS v18. reminder: During the previous update the MSU router had to be downgraded to v17 because of a memory leak bug causing restarts every ~6h.
Incidents:
1) Some BOINC jobs caused /var/log/message flood with the squashfs error. It happened on ~10 work nodes. Also because of BOINC using squashfs, we are trying to rebuild the WN with bigger /tmp area. (1GB/core)
2) MSU router: we added anti-spoofing filter on the packet source address to protect against potential attacks on our DNS or NTP servers. This filter is applied to the public VLAN. At first, the filter specified only the corresponding public subnet. A sampling ping test did not spot any problem. But a few hours later, several hundred condor jobs lost heartbeat and were dropped. It was then found that a fraction of the pings were failing from UM to MSU (but not from MSU to UM?!) on the private subnets. The solution was to add the private subnet to the filter definition. In fact UM had to do a similar thing earlier. It is still not clear why this is needed. This is still being investigated before moving to the new routers and campus research networks.
3) Tried to update from OSG3.5 to OSG3.6 which includes the update in both Condor and Condor-CE.
With Condor, we hit authentication problem after updating the head node from 8.8.12 to 8.9.11 (The work nodes were updated to 8.9.11 a month ago). We could not resolve the authentication issue after trying different configurations, so we rolled back the Condor head node to 8.8.12.
With Condor-CE, after the update, Condor-CE could not receive any new jobs due to authentication failure. After many times of reconfiguration, we still could not resolve the issue, so we rolled back the update, but the authentication issue persisted with the ATLAS jobs (It works with OSG jobs). It is possibly caused by a bug in Condor-CE. (ATLAS jobs keep using an old security session, and Condor-CE rejected it, and send back a rejection message, but this message is blocked by the CERN firewall, so ATLAS jobs could not get this rejection message). Condor-CE started to receive new jobs after ~12 hours (the security session expired at the ATLAS jobs side).
4) update on "pilot error 1151" problem: finished resolving the fewer remaining errors after rebooting this dcache pool node (msufs04). It seemed to have some ~5k hanging processes which were likely locking some memory resources.
5) still debugging transfer failure with CNAF. 60% of the transfer fails, during curl tests, 6% packets loss, very likely a network issue between the 2 sites, not sure which section.
6) having one blackhole work node(it is newly built, something is not working with the cfengine policies), failed over 1000 jobs in one day due to missing software.
7) building the C6420 work nodes with big /tmp (100G) to run BOINC jobs. Recently BOINC jobs have changes to use squashfs, and it requires an extra 1GB per job slot.
8) firmware and software is updated on all the work nodes at MSU.
Upgrading dCache to 6.2.15
Upgraded the ElasticSearch cluster to 7.11.1
Added a SLATE squid at IU for failover. MWT2 is now running two SLATE squids
Work continues on re-location of the MWT2_UC data center; quotes from RFQ swing equipment received, discussing
We had a period where there was nearly zero use of GPFS, near 100% use of NESE and this caused ~20% of jobs to fail on stage-out. A likely solution is to add a few more NESE gateways for NET2 traffic.
We're going to go through and update our OSG registration.
Smooth operations other than that.
Working on xrd via OSG 3.6, preparing for semi-production testing of that setup, at first on the BU side.
OU:
- Nothing to report, running well.
- OU is about to switch over to the new SRR space reporting.
- Possibly maintenance downtime next Wednesday for OSCER cluster updates.
UTA:
Working fine.