US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
Proposed milestones to be add by COB Friday https://docs.google.com/spreadsheets/d/1CF5nSKi2UWiiF4hJpLbJIba_A-2aM00jS14lDDFcplY/edit#gid=634097696
- Note we need more "detailed" milestones in EACH L3 area to cover all of calendar year 2024
Quarterly reports deadline is Friday. All L3 WBS quarterlies should be in by COB today
Working on scrubbing responses due ASAP.
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (this week)
- New osg-xrootd + xcache versions
- HTCondor 10.0.6 in EL7 & EL8 release
- HTCondor 10.6.0 in upcoming (EL7, EL8) and release (EL9)
- NO XRootD 5.6.0 or 5.6.1: we caught issues in our integration testing
OSG 23
- OSG 23 will be the next release series
- Looking like a September release
- See slides 10-12 about OSG 23 plans https://agenda.hep.wisc.edu/event/2014/contributions/28481/attachments/9167/11063/2023-07-11.htc23.osg-software-timeline.pdf
-
13:20
→
13:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- QR complete
- ANALY_BNL_VP queue maxWorkers doubled to 10000
- Usage has remained below ~500 slots. Multicore scheduling issue?
- Issue with home disks filling up at OU
- CVMFS squid failover issue at SLAC (GGUS) - Wei may have solved
-
13:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
XCaches
- running fine
- some bypass at MWT2 and AGLT2 once traffic goes above 2TB/h
VP
- next step in integration in Rucio now in PR.
- probability of a dataset to have a virtual replica at BNL increased factor 5. we will need to look at the VP queue CRIC settings to get it to continuously to run more jobs
ServiceX
- working fine on AF
- more performance optimizations merged
- running fine on FAB. Getting all servicex images to come with special gei.conf
Analytics
- all services work fine.
-
13:30
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
- Understanding the performance of the new cluster, looking into details if something doesn't look right. Overall the production is running fine.
- Noticed a couple of time drop in production level, but it appears to be not specific to K8S cluster, and looks like was due to storage servers getting overloaded.
- With the new hardware, noticed that the nodes with more cpu cores (64/72/96) have overcommiting the node CPU. For the previous cluster I solved this issue by optimizing the job CPU requests coefficient sent from Harvester. Have to look into this, probably readjust it.
- Noticed that K8S was trying to schedule production jobs on the master node. A NoSchedule taint was in place initially but looks like was lost at some point - reinstated.
- Working on reinstalling Prometheus on a dedicated node. And next setting up job accounting reporting.
-
13:40
→
13:45
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:45
→
14:05
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- The last 30 days had good running.
- CPB got most of their FY22 compute online but I leave it to Patrick to describe the status.
- NET2 is pretty close to being online but again I leave it to Eduardo to describe the status.
- Working on info for scrubbing response.
- Also doing the quarterly reporting in parallel.
- Looked at the Tier 2 milestones match what I was aware of.
-
13:45
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
Incidents:
We had two incidents with dCache. On July 8th, the postgresql partion of the head node was flooded by the billing database, and it took us over 24 hours on the weekend to recover it, we are planing to rebuild a R6525 work nodes with larger NVMe cards as the new head node to host a bigger postgresql partition (6TB vs 1TB)
The second incident is on July 19th, 2 dCache nodes had all the pools offline, and caused some transfer failure, restarting the pools fixed the issue.
System update:
We updated HTCondor from 9.0.17 to 10.0.5, and also took this chance to apply firmware and kernel updates with required system reboot. We ran into some token issue because in Condor 10, the TRUST_DOMAIN default value is changed to TRUST_UID, and the tokens used by daemon authentication need to be signed with the same TRUST_DOMAIN. Our fix is to set the TRUST_DOMAIN with the old value.
-
13:50
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- Hardware issues on storage node at UC. Replaced controller and putting it back into service shortly.
- UIUC preventative maintenance today (7/19). Will make sure nodes all come back online in production once it ends.
- Planning to start setting up the WLCG SOC network monitoring hardware next week. Minor disruptions in the network could occur, but no downtime should be needed.
- Building our first set of el9 (AlmaLinux9) worker nodes at IU. Have one in production at UC at the moment and seems to be OK.
- UIUC compute has mostly come in (waiting on a couple chassis). Looking to install what has arrived by the end of the PM, but may have to wait a little longer.
-
13:55
NET2 5mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Storage
Load tests
-Manual transfers from lxplus and CERN FTS service worked
- transfers from BNL FTS service revealed some SSL issues regarding the use of SHA1 for signing
- tested experimental package made by OSG team, no effects
- A second problem related to the transfer mode (PUL) for FTS revealed when transferring from BNL. It was working fine from CERN because it allowed streaming mode. webdav.authn.require-client-cert true was preventing HTTP-TPC from work.
- with FTS transfers working correctly we were able to saturate our network link—WebDav and Xrootd were tested.
- we are talking with Fabio to publish our storage
webdav.data.net2.mghpcc.org
xrootd.data.net2.mghpcc.orgOpenshift
-Progressing, configuring X509 credentials for kubernetes cluster access
- Many problems due to the dual stack setup (network policy controllers not working, Security Context Constraints not working)
-
14:00
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA
- Completed installation of available new machines
- A few machines need repair
- More than 20,500 cores, additional 450 in repair, K8 cluster using 1,000
- Mostly have balanced power and cooling in the data center
- Have deployed one new rack and preparing to deploy another for additional space
- Further work should be invisible as we move machines in groups of one or two at a time to new rack
- Looking at replacing admin node in cluster
OU
- Completed installation of new machines; now 5300 slots plus opportunistic OSCER nodes
- Ordered 3 more R6525, expected to arrive soon
- Have installed slate01.oscer.ou.edu with RockyLinux 9.2, in the process of configuring it
- Today OSCER maintenance, upgrading SLURM from v20 to v23 (or v22, if there are issues with v23)
- Completed installation of available new machines
- The last 30 days had good running.
-
14:05
→
14:10
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
- Met with TACC support about possibly moving us to $SCRATCH2 for more IOPS
- Perlmutter at <15% allocation remaining and running well
- Rui has been working on a way for users to run a custom image on NERSC_Perlmutter_GPU
-
14:10
→
14:25
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:10
-
14:15
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
- Downtime coming up next week for various upgrades(firmware, os, kubernetes, rook-ceph)
- Servicex on FABRIC
- slice stability issue(vms disapear) raised with FABRIC team - Seems to be a known issue that they will deploy a fix
- Should have found a soluction for IPv6 preference(gai.conf to set preference, the default config prefers IPv4 over private IPv6)
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:10