- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
Updates on US Tier-2 centers
05/12/2022
Updated condor from 9.0.11 to 9.0.12
Updated gratia probes on all gatekeepers. Gratia probe stopped working for a day after the upgrade, and it was fixed by reconfigure, and then manually restart condor-ce and run
su - condor -c " /usr/share/gratia/htcondor-ce/condor_meter"
05/17/2022
We migrated the Tier2 NFS server umfs02 to a virtual machine without having downtime. This nfs server provides the home directory for all grid users. The migration hit some problems: 1) the MSU work nodes could not mount the new NFS server because of routing issues . We added the routing rules as a workaround. 2) This nfs server also serves as the archive directory for the dCache postgresql databases’ hot standby replication. For one of the database servers (head01), the hot standby replication did not have a smooth transition during the 20minutes downtime when the NFS servers were swapping, so we ended up reseeding the database from head01 to its hot standby server d-head01.
We converted all 26 remaining SL7 servers at UM site to CentOS7, this includes all the dCache pool nodes and lustre storage nodes.
05/21/2021
The new nfs server(virtual machine) umfs02 lost accessibility, increasing the memory and CPU restored the service.The site drained to 10% usage on 21st because of this incident.
05/23/2022
Gratia on the OSG gatekeeper (gate02) stopped working for 2 days. Restarting condor-ce service fixed that
MSU finished installing and phasing in the 3x new VMware AMD host nodes (ordered Sept 2021).
But still using old direct-attach SAS storage. Last step will be to start using new NVMe storage via iSCSI (also received from 2021 order)
OU:
- One of our 7 xrootd storage servers is having RAID6 issues, so we copied all of its contents to the OSCER ceph scratch and then pointed xrootd there, while we are re-creating the RAID6 array from scratch with two new drives, and then we'll copy everything back. Should take a few days.
- xrootd pointed at the ceph copy seems to work fine.
- We prevented new data from being stored on that server during this maintenance.
SWT2_CPB:
- Installing compute nodes from our purchase earlier this year (48 nodes total).
- Still awaiting delivery of WN's from the previous purchase! Dell claims it's imminent...
- Working to finalize scheduling for our remaining to-do's in Fred's list.
- The partition holding the slurm DB filled up on 5/20 (ggus 157319). Took a while to clean the area and remove the debris. We'll implement some configuration changes to avoid a recurrence.
Started looking into Calico network configuration, and also Lincoln suggested what to modify from parameters to try first, see if that will fix. As I was working on that, noticed a general networking problem.
Waiting for that to be fixed before moving forward with any network related configuration changes on the K8S side.