- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
Updates on US Tier-2 centers
Incidents
05/25/2022
2nd instance of umfs02 NFS server VM problem (used for osghome and our management files)
lost accessibility again with same error
“NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! ”.
Migrated it to different Vmware host
05/27
Trouble with the nfs server umfs02 again
Now suspecting latency from VMware iSCSI storage (TrueNAS), so moved VM to local NVMe storage
Unfortunately this problem causes apparent high "load" and BOINC job control throttles back
06/02
umfs02 in trouble again
realized we were missing NFS server tuning after transition physical to VM
(/etc/nfs.conf, changed threads from default 8 to 512).
Problem solved
Hardware
06/03
UM site received the 10 R6525 work nodes ordered in Sep 2021,
nodes racked/cabled/labeled and provisioned and put in production in 2 days.
Software
6/07
Update dCache from 7.2.15 to 7.2.16,
and also updated kernel and firmware (rebooted to install BIOS updates).
The process went well.
6/08
Update condor from 9.0.12 to 9.0.13 from the osg testing repository.
This will cause an automatic rolling draining and condor restart.
We also set condor to drain and wait on the C6420 work nodes, so we could reboot them to apply the new BIOS updates.
Update HTCondor-CE from 5.1.3 to 5.1.5 from the osg testing repository
on the test gatekeeper gate04.
UTA:
OSG 3.6
IPV6
GridFTP/LSM
OU:
- Still working with Dell to get RAID6 array fixed on cstore13. In the mean time, xrootd is working fine using the copy of the data from that data server on ourdisk ceph scratch partition.
Last week the network of the cluster was still misbehaving, and at the end of the week Patrick replaced the switch which was locking up. That resolved the issue. In the Calico network configuration I modified the IP_AUTODETECTION_METHOD which was the possible suspect, and the system responded that it is updated. The process recreated the Calico pod for that node, but not clear that it did the trick (could be something overrode the parameter), and at least id didn't resolve the connectivity issue for that Calico pod.