US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- OSG 3.5 enters critical/bug fix only support at the end of this month: https://opensciencegrid.org/technology/policy/release-series/#life-cycle-dates.
- Now is the time to start updating HTCondor-CEs to 3.5 upcoming to get ready to accept tokens
- AGLT2
gate01.aglt2.org
gate02.grid.umich.edu
gate03.aglt2.org - BNL
gridgk01.racf.bnl.gov
gridgk02.racf.bnl.gov
gridgk03.racf.bnl.gov
gridgk04.racf.bnl.gov
gridgk06.racf.bnl.gov
gridgk07.racf.bnl.gov
gridgk08.racf.bnl.gov - BU
atlas-ce.bu.edu - MWT2
iut2-gk.mwt2.org
uct2-gk.mwt2.org - SWT2
gk01.atlas-swt2.org
tier2-01.ochep.ou.edu
- AGLT2
- OSG will host another token hackathon this month – admins are free to come with questions, for help updating their CEs, etc.
- Pre-GDB token workshop in October (11-12?)
-
13:20
→
13:35
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 10m
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
writing to Tape from ATLAS has been paused so we can reduce the number of tapes used for writing. We have reduced the number of tape drives used for writing to 4 drives (each capable of writing 200 MB/s) we can write 67 TB a day to tape and we have almost 400 TB to write from internal HPSS disk cache to tape. Can not reduce the number of drives any more
We having a problem staging files from Tape to dCache disks. dCache is not pulling files from the HPSS Cache and we are seeing a lot of bad dCache restore requests. We are investigating and actively trying to clear up the situation so that data can flow.
This problem was triggered by a large request on Saturday mid-day >200k files. Exact source of the request is under investigation
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Good running with scheduled downtimes at AGLT2, MWT2, and NET2.
- A few mysterious periods of draining that I will ask the relevant sites about.
- A few mysterious periods of draining that I will ask the relevant sites about.
- Nearing (I think nearing!) getting final quotes for the FY21 purchases.
- Dell will raise the prices on Sep 1 and this means we need POs by Aug 31.
- Prices are up - way up.
- Almost no choice in CPUs - most are back ordered to ~January 2020. For compute servers the only viable AMD processor is the EPYC 7302 (16C/32T @ 3.0 GHz) which forces either 2 GB or 4 GB per thread. (Last year used the EPYC 7402 (24C/48T @ 2.8 GHz which yielded 2.7 GB per thread.)
- I need to know how many compute servers each site wants and how much money each site has to spend on compute servers.
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
MSU
- last of 4 waves done 11-Aug-2021, mostly MSU T3, thus finishing equipment move from dept bldg to data center.
- purchase planning: 3x ESXi hosts + 1x NVMe storage + ~3 dcache storage nodes + as many 1U WNs as budget allows.
- FYI@MWT2: we may be decommissioning the EX9208 this week, to be confirmed.
UM:
- draining gate03 for condor-ce update, (we lost 1000 jobs when we did update on gate01 with running jobs, to be cautious, we drain the gatekeeper first)
- patched a bunch of nodes with ipv6 issues (adding ipv6 neigh rules manually).
- did 2 condor update (8.4.13->8.4.14->8.4.15)
- rebuilt all Tier2 WN with CentOS7
- finished rebooting all nodes to the new 1160.36 kernel
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
MGHPCC scheduled maintenance.
NESE_DATADISK was down for an additional day for Harvard re-networking.
xrootd is working now, passing smoke tests, HTTP-TPC.
High priority items:
1) Prepare for worker node purchase
2) Expand xrd cluster, switch over from gridftp to xrootd
3) OSG 3.5 update
4) ipv6 finish -
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Nothing to report, running well.
- Upgraded xrootd proxy (se1) to 5.3.1, seems to run well.
UTA:
- Preparing to install the compute nodes + storage from our last purchase. Logistically this will allow us to move forward with the move / retirement of UTA_SWT2.
- About to install the latest version of XrootD on the HTTP-TPC test instance. Need to verify the ROCKS recipe for building the host as a final step prior to production deployment.
- Need to schedule a downtime to install the LAN networking upgrade hardware. Many needed software updates will occur during this outage.
- Recent operations generally stable, smooth.
- Good running with scheduled downtimes at AGLT2, MWT2, and NET2.
-
14:00
→
14:05
WBS 2.3.3 HPC Operations 5mSpeaker: Lincoln Bryant (University of Chicago (US))
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:05
- 14:10
- 14:15
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- ES upgraded yesterday to 7.14. Changing it's monitoring and preparing it for big upgrade (8.0). Manually indexing missed data.
- XCache - all up and running. Now SLATE xcaches auto-configured through GitOps.
- VP - all working fine. Working on Rucio integrations. Rucio Oracle upgrade only in September.
- Squids - all working fine. Preparing things for retirement of "old" squids.
- Rucio GeoIP still not tested due to unrelated changes that make it python 2.7 incompatible. Now version 1.26.2
- 14:30
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10