US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
2
OSG-LHCSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (next week)
- OSG 3.5
- vo-client 115
- python-scitokens 1.6.2
- OSG 3.5-upcoming
- HTCondor 9.0.7 (has GSI, proxy delegation works)
- OSG 3.6
- vo-client 115
- XRootD 5.3.2
- xrootd-multiuser 2.0.3
- XCache 3.0.0
- osg-xrootd 3.6-10
- HTCondor 9.0.7 (no GSI, proxy delegation broken)
- blahp 2.2.0 (no GSI)
- python-scitokens 1.6.2
- OSG 3.6-upcoming
- HTCondor 9.3.0 (no GSI, proxy delegation works)
Miscellaneous
- How's testing of XRootD in 3.6 going?
- HTCondor-CE updates to support tokens
- Known issue with C-style comments outside of routes in JOB_ROUTER_ENTRIES (thanks for the report, Wenjing!): https://opensciencegrid.org/docs/release/notes/#known-issues
- CEs on token-supporting versions of HTCondor-CE
- gate01.aglt2.org
- gate02.grid.umich.edu
- gate04.aglt2.org
- gridgk05.racf.bnl.gov
- osg-gk.mwt2.org
- CEs on old versions of HTCondor-CE
- atlas-ce.bu.edu
- gk01.atlas-swt2.org
- gk04.swt2.uta.edu
- grid1.oscer.ou.edu
- gridgk01.racf.bnl.gov
- gridgk02.racf.bnl.gov
- gridgk03.racf.bnl.gov
- gridgk04.racf.bnl.gov
- gridgk06.racf.bnl.gov
- gridgk08.racf.bnl.gov
- iut2-gk.mwt2.org
- mwt2-gk.campuscluster.illinois.edu
- ouhep0.nhn.ou.edu
- spce01.sdcc.bnl.gov
- tier2-01.ochep.ou.edu
- uct2-gk.mwt2.org
- OSG 3.5
-
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
3
TBD
-
3
-
4
WBS 2.3.1 Tier1 CenterSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running - some site problems....
- AGLT2 was down for a Condor upgrade when the network failed.
- Their new network redundancy scheme is not quite in service.
- MWT2 memcache problem last weekend caused site to drain.
- AGLT2 was down for a Condor upgrade when the network failed.
- Planning continues to expend the funding completely by the end of the grant.
- https://docs.google.com/spreadsheets/d/1-CV5UgeVsDYj8KrVMvuLP0lAVAcjNEQ8TgdYY6911vo
- XRootD continues to need to be restarted once in a while.
- AGLT2 updated to HTCondor 9.06 and HTCondor CE 5.1.osg35 and ran into weird bug involving ignoring comments in a configuration file.
- We continue to bang on removing SRM and getting SRR reporting to be stable.
- As a side effect from this I have noticed that our storage element definitions are in consistent. I think that Horst and Alessandra got this right at OU and we need to iterate at the other sites.
- There has been an extend discussion on setting various CRIC parameters.
- Ofer and I have been planning out how prioritize the updates that we need to get done between now and the start of run 3.
- IPV6 is still not implemented at NET2 and CPB.
- Mark Sosebee will retire in January (though he might come back part time).
-
5
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
Hardware
3 R740xd2 serves from MSU are in production system. The IO benchmark shows the strip size 512K for RAID6 has the best IO performance, about 10% better.
Incident:
11/22/2021, from 10am local time, the 10G commodity link connecting the AGLT2 UM site to Merit went off, so all nodes on the aglt2.org domain name lost access to the Merit DNS servers. The issue was resolved around 7pm when Merit repaired the hardware connecting to this link. During this window, all data transfers were failing and the site was already drained to 8% because of a planned condor update before the network outage.
dcache pool umfs06_12 caused jobs to fail at staging-in files, restarted the dcache service resolved the problem.
System update:
Conor was updated from 8.8.15 to 9.0.6, and condor-ce was updated from 4.5.2 to 5.1.2. During this update, we switched the authentication from host-based to token-based for the Condor Cluster, and that went smoothly because we already practiced it on a testbed. But we hit an issue with condor-ce after the update, where the condor-ce could see the incoming jobs, but the jobs could not be submitted to the local condor system. It took a few hours debugging to find the cause which is the new htcondor-ce does not support the format for commenting in the job router configuration, and this is already reported as a bug to the htcondor development team. At about 13:00 11/23/2021, the site started to ramp up with jobs. And during the entire period with draining and updating problems, BOINC jobs were able to fill all the job slots of the site.
-
6
MWT2Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
One of our MD3460 s-nodes went offline again Nov 15, back up later that day. Discussing retirement plans for these nodes (also now out of warranty).
Site issue over last weekend caused the site to drain. Back online Monday.
IU scheduled to update to HTCondor CE 5 / HTCondor 9 November 29. UC will update the first week of January. UIUC will be scheduled after the UC update.
-
7
NET2Speaker: Prof. Saul Youssef
Source of occasional HC bumping us offline probably found and dealt with.
Problem with the 2 x 100Gb networking between NET2 and NESE Ceph.
Minor post xrd bump: nodes rebooting where container loses GPFS mount.
Planning for networking re-arrangements, worker nodes.
Working on NESE Tape with NESE and MIT teams.Todo:
- new perfsonar hardware
- ipv6
- OSG 3.5/3.6 upgrade
-
8
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Site generally running well
- Still seeing xrootd hangups, have cron in place to restart hung services
- Waiting for next xrootd patch
- Trying to switch SAM/SiteMon over from GRIDFTP to XROOTD for primary SE monitoring, that's currently causing UNKNOWN status, possibly because SiteMon is trying to monitor internal xrootd door, which isn't possible
- SiteMon/MONIT team is looking at this
UTA:
- New Storage is coming online (~2PB) will mostly be used to retire existing storage.
- WebDAV door is performing fairly well, considering the load. Working on converting existing GridFTP to include WEBDav.
- We have IPV6 addresses for the PerfSonar machines and are in the process of setting them up.
- We are investigating an issue where some jobs failed to use the correct FRONTIER_SERVER variable.
- Reasonable running - some site problems....
-
9
WBS 2.3.3 HPC OperationsSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
- TACC
- running fairly well but ran out of jobs. Followed up with DPA and a new dedicated task has been assigned.
- "Texascale" mode coming up - will be offline for a week starting Dec 6 to run only the largest jobs
- NERSC
- Cori scheduled maintenance last week, plus ongoing filesystem instability
- No updates for Perlmutter this week.
- TACC
-
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
10
Analysis Facilities - BNLSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
11
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 12
-
10
-
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
Working numerous issues:
- Cleaning up SRR reporting and disabling SRM protocol fallback (particularly AGLT2 and maybe BNL)
- Fred and Ofer reviewing WLCG storage availability reporting
- Pointed out change in Kibana monitor auth to ADC (result of shift to OpenStack, requires membership in es-atlas-kibana e-group to view); also found plots now missing on brokerage page
- Need to clarify procedure for adding elements to CRIC/Topology
- F-S DevOps: working with Michal to understand squid failover threshold for ticket alerts
- SLATE squid container update to OSG 3.6?
- HTCondor-CE updates? (Pushed back at BNL until next week)
- xrootd standalone server deployed at BNL for testing; Qiulan will help to configure
- Prioritized readiness list for Run3
-
13
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14
Service Development & DeploymentSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
XCaches
- All working fine.
- AGLT2 networking issue was gracefully automatically handled.
- Upgraded to 5.3.4
VP
- Working fine
- RAL still did not upgrade to 5.3.4 and most failures are coming from them.
Rucio
- VP integration development continues. Heartbeat endpoint PR now in review.
- Oracle DB change is in and working fine.
ServiceX
- AF deployed instances work stably
- A lot of developments and cleanups
- Testing FABRIC deployed instance.
- 15
- Cleaning up SRR reporting and disabling SRM protocol fallback (particularly AGLT2 and maybe BNL)
-
16
AOB
-
1