- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
there is a wlcg k8s meeting being organized for June 7, https://indico.cern.ch/event/1096043/
3.6 release planned for tomorrow, contains:
Pending testing, we may also release osg-scitokens-mapfile 8 (for 3.5 and 3.6).
Other software ready for testing:
Working on improving documentation for updating to OSG 3.6, especially for xrootd / xcache services.
Updates on US Tier-2 centers
All gatekeepers are updated to osg3.6 (HTCondorCEVersion: 5.1.3)
working on potential issue with gratia-probe from new-style built-in htcondor-ce probe.
All worker nodes are updated to condor 9.0.11
but gate01/04/aglbatch are still on 9.0.10
UM still waiting on worker nodes from Fall 2021 order (R6525 AMD Rome)
UM and MSU waiting on January 2022 order (R6525 AMD Milan) with now some/all delayed to June 9
(all storage has been received and instlalled)
6-Apr: noticed stage-out problem ; traced to one dcache door (restarted all doors)
Tested OSG 3.6 a while back and currently running it on an opportunity queue. Will update the site in the coming weeks.
Rolling out kernel updates to fix security vulnerabilities.
UC:
IU:
UIUC:
Operational issues:
MGHPCC site level cooling issue for ~1 hour caused NESE Ceph to be offline for about 1 hour.
GPFS disk errors in the system pool needed evacuation. Reduced SGE while this was in process.
88 worker nodes installed and tested, all but 1 rack in production.
Rack of contributed nodes arrived.
Hard to get 100Gb optics from DELL.
New NESE Ceph storage equipment arrived except for Cisco switches, expected in August following long delays.
NESE Tape commissioning: 50PB pool being reformatted today re: IBM firmware issue.
Working with NESE team on NESE Tape expansion.
Quarterly reports in. Hardware spreadsheet updated.
Run 3 software prep: Working on perfsonar & ipv6. OSG 3.6 to follow.
UTA:
UTA_SWT2 is officially shut down and the equipment has been moved back to campus work is ongoing to move compute nodes into SWT2_CPB and K8 clusters.
Existing CE (gk01.atlas-swt2.org) will move to new hardware and OSG 3.6. The last existing job should drain today.
Investigating GRACC reporting issue related to two gatekeepers.
OU:
New GK and SLATE squid are about to be installed to be ready for testing.
Found work around for xrootd5 transfer failures with new emi/rucio ALRB tests. xrootd5 client operations from compute nodes against newly upgraded xrootd5 backend storage would insist on using TLS, and since the OSG CE propagates X509_USER_CERT and X509_USER_KEY to wn-client environment, and hostcert/key files don't exist there, TLS would fail. Unsetting X509_USER_CERT and X509_USER_KEY in atlas_app/local/setup.sh.local prevents these failures.
The startup K8S cluster works fine. I created a queue for it in CRIC: SWT2_CPB_K8S, which was later used for tests.
After that I created Harvester service account in the K8S cluster, and generated a kubeconfig file used on the Harvester side to communicate with the cluster. Patrick did firewall reconfiguration. But there was still a communication issue at first. This was tracked down to the initial setup of the cluster, when kubeadm init step by default picked up only the private IP address of the control plane. I regenerated the api server certificate to include also the public IP address, and did all the related reconfiguration, after which communication with the cluster was established. After that Fernando managed to submit several grid test jobs, and they reached the workers, but stuck there in a waiting state. So looking into that right now.
On the hardware side our admins will start adding nodes to the K8S cluster from the UTA_SWT2 cluster which arrived last week, and probably we'll also update some of the existing worker nodes.