US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
2
OSG-LHCSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (this week)
- HTCondor 9.3.0 in OSG 3.6 upcoming
- osg-ca-certs-updater for EL8
Other
- Update your CEs to HTCondor-CE 5 and HTCondor 9 from OSG 3.5 upcoming!
- FaHui is updating central Harvester infrastructure to support token-based pilots
- For pilots, we expect to be able to consolidate token -> user mappings to a single user!
- Working with the HTCondor team to figure out a solution for mapping SAM/ETF tests to a separate user (scope based mappings?)
-
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
3
Removing SRM from TPC for US Sites
A quick presentation about removing SRMv2 protocol for Third party copy for our sites.
Speaker: Shawn Mc Kee (University of Michigan (US))
-
3
- 4
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Relatively rough period over the last two weeks:
- Several problems on the CERN side
- Network outage on 10/15 (Friday)
- Draining problem over the weekend (10/17-10/18)
- Yesterday someone at CERN removed an "unnecessary" DB link.
- Other decreases were site maintenance amd various site networking problems.
- Large increase in job slots for AGLT2 is a redefinition of BOINC Job as 8 slots rather than 1 job.
-
- Several problems on the CERN side
-
5
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
Overall smooth 2-week operation (until today)- new problem : 60% of pilots failing, but other 40% payload normal (2.4% failure)
Being asked to check condor config, do not locally retry jobs that failAlso 2 events in BOINC queue, presumably unrelated
- CERN operation error with DBRelease yesterday seemed to have caused job failures
- monitoring: job count has jumped from ~2k to ~12k.
Presumed to be proper accounting of multicore payload.Storage purchase
- UM received 5x R740xd2 last week, installed, in production, adding 1180 TB usable
Retired 2x older storage nodes (678 TB) with 4x MD3xxx shelves (including umfs11 which had the recent hardware problems)- MSU received 3x R740xd2 this week, racked, soon into production, adding 708 TB usable
- net + 1210 TB
-
6
MWT2Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
-
7
NET2Speaker: Prof. Saul Youssef
-
8
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Overall, site running well.
- xrootd hung up on proxy (se1.oscer.ou.edu) again last night, restarted. Andy and Wei are looking into it.
- Problem exacerbated by rucio copytool (site mover) still not doing write_lan correctly, meaning all stage-outs are still routed through se1 instead of directly to the local redirector, That really needs to get fixed soon, since it causes huge inefficiencies.
SWT2_CPB:
- XRootD on the webDAV host is more stable following the update to the curl libraries
- Installed more storage from the most recent purchase (Summer)
- setup new perfsonar BW host. Both machines now new hardware.
UTA_SWT2:
- Retirement progressing
- ddm ops re-started the cleanup of the SE. As of this morning ~10 TB of data known to rucio remain
- Relatively rough period over the last two weeks:
-
9
WBS 2.3.3 HPC OperationsSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
TACC
- Needed to update local copy of ALRB etc to get the VO Client update at TACC (same as what affected Tier 2s over the weekend)
- System maintenance yesterday, some job failures because of it.
- Priority is looking better overall, jobs are generally going through.
- 32% of allocation remaining
NERSC
- Filesystem degraded on Monday causing job failures.
- 60% of 20M additional node hour allocation remaining.
- Another 10-20M hours possibly coming. Will need to use or lose it before end of year.
-
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
10
Analysis Facilities - BNLSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
11
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
12
Analysis Facilities - ChicagoSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
10
-
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
13
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14
Service Development & DeploymentSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
XCaches
- all running fine.
- 5.3.2 is out and I am testing it. Will update everything this week.
- I am adding a container that will send rucio heartbeats.
VP
- all running fine
- only issues with RAL. Should be solved with that 5.3.2 update.
Rucio
- work on integrating VP
- json database for placements (PR is ready)
- adding heartbeats (almost ready)
- work on the placement engine not started yet
ServiceX
- upgraded uproot xrootd plugin version
- much better performance with the big fast xcache.
- will try deploying in FABRIC.
- 15
-
13
-
16
AOB
-
1