US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Token transition
- Register for the Oct 14 & 15 workshop! https://opensciencegrid.org/events/Token-Transition-Workshop/
- Update your CEs using packages from 3.5 upcoming
- HTCondor 9.0.5 contains an important change that improves bearer token + GSI support
- There are known bugs with the new job router config transform syntax and fixes are expected HTCondor-CE 5.1.2. The old job router config syntax still works fine, though!
- Docs for installing a new CE: https://opensciencegrid.org/docs/compute-element/install-htcondor-ce/
- Docs for updating an existing CE: https://opensciencegrid.org/docs/release/updating-to-osg-35/#updating-to-htcondor-ce-5
- CEs still on versions that don't support token submission:
atlas-ce.bu.edu
bgk01.sdcc.bnl.gov
bgk02.sdcc.bnl.gov
gate01.aglt2.org
gate02.grid.umich.edu
gate03.aglt2.org
gk01.atlas-swt2.org
gk04.swt2.uta.edu
grid1.oscer.ou.edu
gridgk01.racf.bnl.gov
gridgk02.racf.bnl.gov
gridgk03.racf.bnl.gov
gridgk04.racf.bnl.gov
gridgk06.racf.bnl.gov
gridgk07.racf.bnl.gov
gridgk08.racf.bnl.gov
iut2-gk.mwt2.org
mwt2-gk.campuscluster.illinois.edu
ouhep0.nhn.ou.edu
spce01.sdcc.bnl.gov
tier2-01.ochep.ou.edu
uct2-gk.mwt2.org
- FaHui verified token-based submission to the MWT2 CE and is working on making necessary changes to Harvester
-
13:20
→
13:35
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 10m
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- It was a very good running period for the last two weeks. Prod job failures were low:
- AGLT2 1.0%
- MWT2 1.2%
- NET2 4.1%
- SWT2 OU 0.8%
- SWT2 CPB 7.3%
- Could each site report on their order status - I think most sites have their orders in but would like to confirm.
- IPV6 Status at NET2 and SWT2?
- HTTP TPC at NET2 and SWT2
- MWT2 (and BNL) starting investigation of HTConder CE 5.1.1 / Condor 9.x using OSG 3.5 upcoming (preparation for move to OSG 3.6 )
- Still working on SLATE / Frontier-Squid: OU Setting up server and CPB moving OSG jobs to the SLATE Squid.
- Moves at UC and UTA proceeding.
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
09/01/2021
A fiber cut between the University of Michigan and Chicago happened yesterday and, because of our network resiliency, connectivity for MOST networks failed over to an alternate path. However, there was a problem with the LHCONE peering and it did not fail over correctly, so it caused transferring failure (over 15000 jobs in transferring status) and SQUID monitoring issues (both slate squid servers appear to be down in the CERN squid monitoring while they are actually running). This has just been fixed (Sep 2 2021 around 10:35 AM Eastern time).
09/06/2021
One of the dCache pools (umfs11_6) became offline which resulted in some file inaccessible ( ggus 153689), the pool was put online once the problem was identified.
09/07/2021
Retired rack 117 (became MSU T3), 39xR610 with dual 16 HT core E5520@2.27GHz, 624 HT core toral
AGLT2_UM has equipment ordered with Dell but delivery times are from Dec 1, 2021 to January 5, 2022. MSU site order is still waiting for approval.
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
Tested gatekeeper + queue with HTCondor-CE 5.1.1 and Condor 9.0.5. Were running 350 jobs with Fahui's test harvester instance
Applied mitigation for CVE-2021-3715 to all three sites
Running network tests between the old and new UChicago datacenters before putting transitional storage online and starting datacenter transfers
Brief power outage last weekend in the UChicago datacenter. No critical services were affected, but a number of compute rebooted
POs submitted for all three sites
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
o 1 GGUS ticket, some failures from NESE Ceph endpoints due to overloading.
o P.O. for 88 worker nodes at DELL, + 2 for Tier 3
o 8TB disks failing in out-of-warranty GPFS, evacuating LUN behind the scenes.o Webdav with Xrootd 5.3.0 passing smoke tests. Ramping up for production. ETA end of this week, NESE endpoints to follow.
o Ramping up work on NESE Tape endpoints. Working with IBM, MIT, NESE operations team.
o Preparing for NET2 expansion including UMass, large new compute resources.
o MIT has purchased equipment for upgrading MGHPCC WAN to 1800 Gb/s total on the MGHPCC-Boston-NYC-Albany loop.
o Met with Mark Sosebee re: spiffing OIM entries. -
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2:
- Starting the retirement process for the cluster
- Created UTA_SWT2_UCORE_RET panda queue with remote I/O
- HC worked initially but went offline this morning
- Once this is in place, will empty UTA_SWT2_DATADISK
SWT2_CPB:
- Deployed HTTP-TPC service on gk06.atlas-swt2.org
- Passing DTEAM and ATLAS smoke tests, moving to DOMA smoke tests
- In process of deploying ~2PB of storage and new compute nodes
- equipment racked, working on configs / burn in
- Starting to redeploy the K8 Cluster for Armen's use, hopefully something by next meeting.
OU:
- Not much to report, site running well.
- Starting the retirement process for the cluster
- It was a very good running period for the last two weeks. Prod job failures were low:
-
14:00
→
14:05
WBS 2.3.3 HPC Operations 5mSpeaker: Lincoln Bryant (University of Chicago (US))
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:15
-
14:05
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Webdav protocol added to SWT2_CPB SE; Alessandra issued PR to add SWT2 to DOMA smoke tests after manual testing was successful.
- NET2 should be in production with HTTP-TPC by end of this week
- Proposal to move ATLAS completely off gsiftp transfers (including tape) by end of the calendar year, with cleanup and removal of gsiftp from all configurations by start of Run 3
- Upcoming downtimes for all ADC services due to Oracle DB upgrade
- Mon. 9/20, 10:00-10:30 CET
- Mon. 9/27, 10:00-18:00 CET
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
XCache
- issue with TLS now used by two sites. Solved.
- All the SLATE xcaches updated to 5.3.1 and redeployed.
- Otherwise working smoothly
VP
- ALGT2 VP is again getting jobs
- all working fine
ES
- Soon will start migration to new computing center.
- Will be done in two steps.
- Should not need any downtime
Rucio
- a bug found in how client location was cached. fixed and will be in testing on Sep 27th. In production week later.
- will need to switch to use Lat, Long for sites as found in CRIC
- 14:30
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10