- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
3.5.28 (this week)
Miscellaneous
Updates on US Tier-2 centers
Site | Total | Done | Failed | Canceled | Closed | %Failed |
---|---|---|---|---|---|---|
AGLT2 (no BOINC) | 93060 | 68754 | 6751 | 210 | 4643 | 9% |
MWT2 | 276422 | 224597 | 12613 | 991 | 8422 | 5% |
NET2 | 71892 | 49350 | 12264 | 744 | 3934 | 20% |
OU_ATLAS | 8015 | 4324 | 58 | 31 | 638 | 1% |
OU_ATLAS_OPP | 3797 | 3219 | 17 | 17 | 514 | 0% |
SWT2_CPB | 97828 | 79512 | 2922 | 352 | 3818 | 3% |
UTA_SWT2 | 32183 | 26396 | 414 | 413 | 1273 | 2% |
Incident:
One ISCSI storage devices used by our vmware clustre to store vm images fail completely, 1/3 of the vms became unresponsive, including dCache door nodes, htcondor head node and gatekeepers, the site remained downtime for 1.5 days. The Dell storage was recovered without losing data, and we migrated the vm images to other storage locations.
gatekeeper had very few incoming jobs, it was recovered after we restored the ISCSI vmware storage device.
Site was flooded (over 60% job slots) with high memory jobs requesting 3G to 6GB RAM, most of our work nodes do not have that much RAM per core, hence some became unresponsive due to heavy swap usage. This is because of the misunderstanding about the high memory queue. We thought it was set 3GB/core maxrss, still working on job routing rules to adapt to this change, also set a limit of jobs on the high memory queue.
Ticket
closed ticket 149378: dcache transfer/deletion error, deletion error was caused by a down dcache door node which was caused by vm storage issue. declared lost files which lost metadata in dCache, so we missed them when we were summarizing the lost files on 4th Oct due to losing 2 virtual disks.
UTA_SWT2:
May need to shrink OSG pool if fewer COVID jobs are running
SWT2_CPB:
Lingering problem with data transfers (ticket 149701). Suspect nearly empty dataserver is cause. Will reevaluate when server is drained.
OU:
- No site problems, running well
- Site was drained yesterday, but Rod fixed that by fudging weights
NERSC down to less than 5M MPP hours. Might not get any more time. We have been given 50M hours above our intial allocation of 104 M MPP hours.
NERSC down 15-20 Dec.
TACC ramped up to 7K concurrent slots before outage. In last week simulated 7.8M events
ALCF is ramping up.
Raythena debugging continues
XCache
* had issues with full ephemeral storage on NET2 and AGLT2.
* agreed with Andy and Matevz on XrootD CCM plugin for sending heartbeats from servers. Should be ready for 5.2.0
VP
* agreed with Rucio folks on xcache/CRIC/Rucio/VP communication.
ServiceX
* Testing deployments at different k8s clusters.