US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
We have an early deadline for quarterly reports because of the review at the end of the month. Reports are due by Friday, January 14, 2022 (week from this coming Friday). To allow Rob and Shawn to get our WBS 2.3 version completed, we need the level 3 (WBS 2.3.x) reports done by Wednesday, January 12, 2022. Please try to get these completed ASAP.
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Several packages ready for testing:
3.5.53-upcoming and 3.6:
- HTCondor-CE 5.1.3 (various bugfixes, see https://opensciencegrid.atlassian.net/browse/SOFTWARE-4951)
- XRootD 5.4.0 (new features and bug fixes, see https://github.com/xrootd/xrootd/releases/tag/v5.4.0)
3.6 only:
- oidc-agent 4.2.4 (new major version, see https://github.com/indigo-dc/oidc-agent/releases for changes since 3.3.3)
- cvmfs 2.9.0
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 30m
-
13:20
- 13:50 → 13:55
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- No big crises during the holiday break.
- All sites had some problems.
- The was a major issue at CERN that messed up the monitoring but the missing data is now available.

- The quarterly reporting is due early this year. I want your input by the end of the day next Tuesday I listed 4 specific items that I want each site to address in their report on (some sites will simply report that they in the final configuration for the start of Run 3 for some items):
- Updating OSG and Condor versions.
- Updating storage version.
- Updates to the queuing system.
- IPV6
- Seems like XRootD may be making progress toward stable HTTP-TPC transfers.
-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
- updated ELK for log4j issues, applied other security updates.
- MSU received SLATE node, being installed with Alma Linux (per SLATE team request)
- MSU received network capture node, will also be Alma Linux (for Milan CPU)
- MSU received VMware storage node
- purchase plan: R740xd2 with 18T drives, R6525 with AMD 7452 (128 HT/node), final count to be determined after final quotes. Roughly $500k total and roughly 50/50 for storage/compute.- Rebooting the cisco border switches caused ipv4 issues among various machines on the UM site, it caused cvmfs failure and squid servers failover. It took a couple of days to debug (between UM ITS and cisco support)to fix the issue.
- Another slate squid issue: does not show traffic on the CERN squid monitoring, had to rejoin the nodes to the k8s cluster to fix it.
- A patch was applied to the cisco border switches, which fixed the IPV6 forwarding (to Dell management switches) issues, so we were able to bring all the R620 work nodes whose data connections are through the Dell management switches back to condor.
- Merit Networks has had another issue on MiLR (our network that connects us to Chicago and East Lansing). This has broken our default route and access to and from AGLT2 from non research and education (R&E) networks.
- a typo in the routing rules change caused ipv6 ping failure to all CERN machines, a lot of jobs fail at rucio timeout. It was fixed the next morning.
-
14:00
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
Upgraded uct2-gk to htcondor-ce 5.1.2 and condor 9.0.8 this morning
One of the UC dCache nodes went offline December 26th. Pools were brought back up that day
Second set of dCache transfers finished for the UC server room relocation. Next move is scheduled for January 24th
New IU and UIUC compute nodes online. Revised UC order submitted, still waiting on an estimate shipping date
Surplus UC servers arrived at IU. Fred is in the process of installing
Discussing upcoming purchase order. Fred is working on benchmarking and quotes
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef
o We had staging issues over the break and had to limit the total number of jobs by hand.
o Down time on Tuesday Jan 11 for
- Retiring 3TB pool (770TB)
- NFS kernel upgrade
- Preparations for new worker nodes
o Adding 4 DTN nodes to increase the GPFS-worker bandwidth.
o About to place orders on a new NESE Ceph rack to add to NESE_DATADISK. 3.8PB raw, 12 new DTN nodes.
o NESE Tape working, coming online.o Pressing Harvard on ipv6.
o Plans for NET2 expansion with UMass, bare metal cluster, etc. nearing finalization.
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Nothing to report, all running well.
UTA:
Problems still occurring with WebDAV door. We are going to upgrade the version of XrootD and setup the existing gridftp servers to take the load of transfers.
Over the break, we had one small downtime as the chilled water for the lab was being worked on. Fortunately the cooling was maintained and were able to come back quickly.
- No big crises during the holiday break.
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
TACC
- SLATE node has an issue - Kubernetes has broken itself.
- Working with TACC team to fix this
- Harvester broken as well, because it was using SLATE node for MySQL DB
- Standard Sqlite installation won't work at TACC for some reason. Something strange in the environment?
NERSC
- Allocation approved for Perlmutter, we have 500K CPU hours and 11K GPU hours on Perlmutter starting Jan 19th for 1 year
- Cori failing large number of jobs - logs indicate SLURM is cancelling the jobs after about 30 minutes.
- SLATE node has an issue - Kubernetes has broken itself.
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:20
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
Analysis Facilities - Chicago 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
Tier3 status update provided by Fengping Hu et al... sent to Alessandra:
- https://docs.google.com/presentation/d/1RZeeTkCZ8biLEXGGDhKsuN-nOM5Pf6dHcwy0eXPWHMw/edit?usp=sharing
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- BNL HTCondor-CE's have been upgraded (thanks Xin!)
- ANALY_BNL_VP queue issues were traced to a deactivated CE, followed by a problematic "flavour" value in CRIC, then a maxWallTime=0 pilot setting....jobs seem to be running again as of this morning
- File transfers, and staging, apparently continued during the BNL tape service downtime on 12/29 13:00-17:00 UTC (link). Why?
- MWT2 squid service briefly degraded after SLATE reconfiguration and failed restart
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
XCache
- working fine. restarted all SLATE instances to get some small changes in.
- AGLT2 nodes needed intervention from Wenjing and Mohammad
VP
- working fine
- BNL_VP now getting jobs but jobs failing. Xin and Ofer looking at it. Not related to XCache
Analytics
- ES running fine. Preparing the next batch of servers for transport
- updating all the logstashes. there are 4 running.
- Updating Alarm & Alert frontend.
ServiceX
- stress testing
- testing for graceful handling of errors
-
14:45
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
14:55
→
15:05
AOB 10m
-
13:00
→
13:10