US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
there is a wlcg k8s meeting being organized for June 7, https://indico.cern.ch/event/1096043/
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
3.6 release planned for tomorrow, contains:
- osg-configure 4.1.1, with a fix for a gratia probe check on sites with an HTCondor batch system (reported by Xin)
Pending testing, we may also release osg-scitokens-mapfile 8 (for 3.5 and 3.6).
Other software ready for testing:- OSG 3.6:
- CVMFS 2.9.2 (bugfix release, see https://cvmfs.readthedocs.io/en/2.9/cpt-releasenotes.html)
- voms clients 2.0.15-1.5 (el7) and 2.1.0-0.14.rc2.5 (el8): increase default proxy size for 2048 bits (smaller proxies are rejected by el8 servers)
- OSG 3.5-upcoming and 3.6:
- xcache (including atlas-xcache) 2.2.0 (OSG 3.5-upcoming) / 3.0.1 (OSG 3.6):
- Fixed xcache-reporter and xcache-consistency-check library issues (causing them to fail to run)
- xcache (including atlas-xcache) 2.2.0 (OSG 3.5-upcoming) / 3.0.1 (OSG 3.6):
Working on improving documentation for updating to OSG 3.6, especially for xrootd / xcache services.
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 30m
-
13:20
-
13:50
→
13:55
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running over the past two weeks with some issues.
- AGLT2 had two incidents where the site drained
- MWT2 had two incidents: DATADISK full & AC problems
- NET2 had lots of little incidents and is nearly fully drained today
- SWT2 had a couple of minor incidents.
- Spent some time looking at how quickly sites are put online by Hammer Cloud. There was an extended discussion with Rod. In the end it was not clear why it took 3 hours to put MWT2 online after one of the draining issues but Rod is pushing the HC people to provide better displays so a site can see what the hold up is.
- Also spent some time looking at why a site can be delayed in refilling after being set online. Rod said that we need to contact them when it's happening to debug this. Here is a plot of the number of job slots full vs time (UTC) showing the site draining and refilling during the MWT2 AC incident (plot is for 24 hours):

- We have about 2 weeks to get OSG 3.6 in place. Let's do it!
- Let's get IPV6 going at NET2 and SWT2.
- UC finished the physical moving of gear to the new machine room but various mopping up continues to get things fully back online.
- UTA_SWT2 has finally shuffled off this mortal coil never to return.
- Could Mark/Patrick please get it disabled/removed from OSG/CRIC.
- GET YOUR QUARTERLY REPORTING IN TODAY!
-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
All gatekeepers are updated to osg3.6 (HTCondorCEVersion: 5.1.3)
working on potential issue with gratia-probe from new-style built-in htcondor-ce probe.
All worker nodes are updated to condor 9.0.11
but gate01/04/aglbatch are still on 9.0.10UM still waiting on worker nodes from Fall 2021 order (R6525 AMD Rome)
UM and MSU waiting on January 2022 order (R6525 AMD Milan) with now some/all delayed to June 9
(all storage has been received and instlalled)6-Apr: noticed stage-out problem ; traced to one dcache door (restarted all doors)
-
14:00
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
Tested OSG 3.6 a while back and currently running it on an opportunity queue. Will update the site in the coming weeks.
Rolling out kernel updates to fix security vulnerabilities.
UC:
- New Purchases
- New compute racked and cabled. Working on building.
- New storage same situation. Getting some networking situation before we do these though.
- Temporary networking solution in place for compute until purchased production equipment arrives.
- Cooling issue brought down a couple storage nodes last week. They've been brought back up and we're running fine since.
- Working on hardware gatekeeper as we're currently only running off the IU gatekeeper.
- Discussed upgrading dcache to 7.2.x. No concrete date yet.
- Final move phase finished. No more major moves for ATLAS equipment.
IU:
- New compute is online.
- Looking to purchase a few more with remaining funds.
UIUC:
- New compute is online.
- PDU issue in the data center was causing benchmarking to trip power, but has been fixed and benchmarks finish without trouble.
- New Purchases
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef
Operational issues:
MGHPCC site level cooling issue for ~1 hour caused NESE Ceph to be offline for about 1 hour.
GPFS disk errors in the system pool needed evacuation. Reduced SGE while this was in process.88 worker nodes installed and tested, all but 1 rack in production.
Rack of contributed nodes arrived.
Hard to get 100Gb optics from DELL.New NESE Ceph storage equipment arrived except for Cisco switches, expected in August following long delays.
NESE Tape commissioning: 50PB pool being reformatted today re: IBM firmware issue.
Working with NESE team on NESE Tape expansion.Quarterly reports in. Hardware spreadsheet updated.
Run 3 software prep: Working on perfsonar & ipv6. OSG 3.6 to follow.
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
UTA_SWT2 is officially shut down and the equipment has been moved back to campus work is ongoing to move compute nodes into SWT2_CPB and K8 clusters.
Existing CE (gk01.atlas-swt2.org) will move to new hardware and OSG 3.6. The last existing job should drain today.
Investigating GRACC reporting issue related to two gatekeepers.
OU:
New GK and SLATE squid are about to be installed to be ready for testing.
Found work around for xrootd5 transfer failures with new emi/rucio ALRB tests. xrootd5 client operations from compute nodes against newly upgraded xrootd5 backend storage would insist on using TLS, and since the OSG CE propagates X509_USER_CERT and X509_USER_KEY to wn-client environment, and hostcert/key files don't exist there, TLS would fail. Unsetting X509_USER_CERT and X509_USER_KEY in atlas_app/local/setup.sh.local prevents these failures.
- Reasonable running over the past two weeks with some issues.
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- GPU hosts are online, will be partitioned in 2x3 (20G) mig gpu per host. Xin is working on enabling HTCondor scheduling (need to publish GPUs for use with partitionable slots)
- Image attached with the mig options Doug found
- Snowmass CompF4 presentations on AFs last Friday
- News about review and UC onboarding event tomorrow
- GPU hosts are online, will be partitioned in 2x3 (20G) mig gpu per host. Xin is working on enabling HTCondor scheduling (need to publish GPUs for use with partitionable slots)
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:30
-
14:20
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- SBU and SMU storage endpoints decommissioned by DDM
- Thanks to Doug's Spring Cleanup, ten Tier-3 sites deleted (or disabled) in CRIC (ANLASC, BELLARMINE, Brandeis, Hampton, OLCF, Penn, SBU, SMU_HPC, Tufts, UPitt)
- UTA_SWT2 deactivated in OSG Topology (squid removed from monitoring)
- Follow up actions needed in CRIC?
- OSG 3.6 HTCondor-CE deployed and being tested at BNL on gridgk05
- Bug found and fixed in osg-configure (see https://opensciencegrid.atlassian.net/browse/SOFTWARE-5115)
- Harvester jobs are running successfully; will proceed with updated on remaining gatekeepers once updated osg-configure rpm is released in production
- Petr noted but in XRootd 5 client with new gfal2, related to missing host cert files that are not needed; affects our XRootd/Slurm sites; there is a workaround and Horst filed ticket with OSG
- Massive HC exclusion yesterday related to massive spike in held jobs - HC was disabled, but back now
- Shifter monitor and/or alarm?
- Need QR info today please
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14:45
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
The startup K8S cluster works fine. I created a queue for it in CRIC: SWT2_CPB_K8S, which was later used for tests.
After that I created Harvester service account in the K8S cluster, and generated a kubeconfig file used on the Harvester side to communicate with the cluster. Patrick did firewall reconfiguration. But there was still a communication issue at first. This was tracked down to the initial setup of the cluster, when kubeadm init step by default picked up only the private IP address of the control plane. I regenerated the api server certificate to include also the public IP address, and did all the related reconfiguration, after which communication with the cluster was established. After that Fernando managed to submit several grid test jobs, and they reached the workers, but stuck there in a waiting state. So looking into that right now.
On the hardware side our admins will start adding nodes to the K8S cluster from the UTA_SWT2 cluster which arrived last week, and probably we'll also update some of the existing worker nodes.
-
14:55
→
15:05
AOB 10m
-
13:00
→
13:10