US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (tomorrow)
- HTCondor 9.0.2, BLAHP 2.1.0 (3.5 upcoming, 3.6)
- XRootD 5.3.0 (3.5 upcoming)
- voms client to support requesting VOMS proxies from IAM
- XCache 2.0.1 (3.5 upcoming)
Miscellaneous
- OSG Yum repos down, subscribe to updates here https://status.opensciencegrid.org/
- Multi-resource downtime page should be available this week or next
-
13:20
→
13:35
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 10m
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
- Currently adding more space from the DATALAKE to the ATLAS DATA-Tape and MC-Tape dCache staging pools. This should reduce the churn that we are seeing. (ie files copied from tape to staging disk and then removed before ATLAS copied the files away)
Reminder - HPSS (tape system) downtime 2-Aug-21 through 7:00 pm - 5-Aug-21
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Some issues in the last couple of weeks
- MWT2: Enterprise switch upgrades to main enterprise network switches at UC this Wednesday and last Wednesday. The change last week caused IPV6 issues
- AGLT2: IPV6 issues and a full work area caused problems.
- MSU: Moving to new location today.
- Illinois: Today is quarterly prevent maintenance period.
- Get your reporting in today!!!
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
1) MSU site is moving 65 WNs to the new DC, i.e. all the newer WNs (R620s, R630s, C6420s).
2) UM site is working on the ipv6 issues on the new network. 2 causes, we solved one set of the problem by adding static IPv6 ND mapping to the gatekeeper, still working on set 2 problem from the R620s connected their data cables to the management switches
3) Job failures: 40% failure on 20th July due to 2 errors, "payload metadata does not exit", which disappeared on 21st July. (AGLT2 has the biggest number of failed jobs for this error within usatlas, but some other sites have similar errors). "no local space" error, the home directory for the usatlas users are full after years of piling up of small files, we cleaned the space and set up a cronjob to clean it.
details about 2)
More work nodes are having ipv6 connectivity issues (do not reach gw), there are 2 set of causes: one is possibly by a bug in either the juniper or the cisco switch border switches. The workaround is to add the static ipv6 ND mapping to the juniper gateway. (We have added all work nodes). Hopefully this will be resolved when we can get rid of the juniper gateway (using cisco instead) in August. Two is the management switches (S3048) have ipv6 issues. We have ~20 R620s which need to connect to the management switches for data connections, we havn't found a solution to that yet, so retired condor on all R620 work nodes for now
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
UC:
- North border router was upgraded last week (7/14) and the south border router is being upgrading this week (7/21)
- After the north border router was swapped out, routing moved to the south which led to IPv6 issues over the weekend and early this week. UC network engineers worked on a fix, but ultimately moved IPv6 routing through the north border router temporarily.
- GGUS #153052 associated with the IPv6 issue (transfer issues)
- 0% transfer efficiency with NERSC-PDSF
- Relocation equipment trickling in.
IU:
- New management nodes up and running.
- Working on getting new PerfSonar machines set up
UIUC:
- SLATE node arrived. Needs built and configured.
- Quarterly PM today (7/21)
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
GGUS tickets: 0
HC blacklists: 0
o Smooth operations
o Site full except for a dip around 2021-07-15 (unknown if it's a widespread dip)
o Advanced stages of getting ready to buy worker nodes.
o xrd 5.3.0 installed and working in our custom container
o Successfully exporting NET2_DATADISK, _SCRATCHDISK, _LOCALGROUPDISK
o Endpoint atlas-xrootd.bu.edu registered in CRIC
o Configured for HTTP-TPC, custom adler32, both work successfully
o Getting put into "smoke tests" by Alessandra & co.
o Some problems remain, possibly related to transfers to dcache sites, Wei and Andy are investigating.
o NESE Tape ATLAS endpoints have arrived, expect to be racked and cabled this week.
o perfSonar node rebuilt with new hardware, both nodes are ipv6 now.o Annual MGHPCC power maintenance, August 9
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
Setting up a test host, as proxy, with test version of XRootD 5.3 from OSG. Software installed, working on configuration.
Operations mostly smooth over period
OU:
- Smooth operations, ran low on jobs occasionally.
- XRootD 5.3.0 installed, HTTP-TPC working, waiting to be included in smoke tests.
- Some issues in the last couple of weeks
-
14:00
→
14:05
WBS 2.3.3 HPC Operations 5mSpeaker: Lincoln Bryant (University of Chicago (US))
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:05
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
Analysis Facilities - Chicago 5mSpeakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
A few more compute nodes showed up and have been racked, built, etc. and added to the cluster.
A couple more interactive machines showed up, but haven't been racked and built yet. These (along with the three machines mentioned above) aren't necessary for us to go into production.
Still waiting on the GPU machine. We believe sometime in November is when it will arrive (according to Dell).
We've gotten a condor queue up and running. Can submit jobs from both submit hosts we're planning to have for users day 1.
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Created FedOps team email list and documentation page, waiting on RT queue
- XRootd 5.3.0 release should now be deployed at sites - need to add into DOMA smoke tests
- BNL xcache updated and operational again (lost one NVME drive) - will wait for Ilija before activating VP queue
- Mark working with Saul on topology clean-up for NET2
- Working on Quarterly Report
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
https://docs.google.com/document/d/1YDvIHMYihczN9zX-wxyBp2k7pBzS8YoDnA0EPhXspUo/edit?usp=sharing
describes emerging FedOps procedure for Frontier-Squid.
- 14:30
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10