US ATLAS Computing Facility
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
2
OSG-LHCSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
-
Topical Report : Data Carousel
Data Carousel Update
Convener: Xin Zhao (Brookhaven National Laboratory (US)) -
3
WBS 2.3.1 Tier1 CenterSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
-
Update of dCache to 5.2 foreseen, after reprocessing
-
MAS progressing well with some delays due to lack of effort
-
~1PB of unused data moved to intermediate storage, data moved to MAS will be deleted from DATADISK
-
BNL_LAKE_UCORE PQ now running production jobs, usage will be monitored
-
Presentation at the WLCG QoS workshop this Friday
-
-
Reprocessing had some operational issues (interference with data consolidation, T2 issues affecting T1 stage-in, etc….) The whole chain may need to be revisited?
-
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Sign up for the OSG All Hands Meeting (AHM):
https://opensciencegrid.org/all-hands/2020/
- At Shawn McKee's request, I asked each Tier 2 to look over the description of their site in the management document. Please do this.
- I have been pinging sites about mysteries that I found in the 2019 LCG accounting numbers
- High memory jobs caused very low efficiency in December at BNL
- Still need to workout how to account for non ATLAS production jobs (BOINC, OSG, etc.)
-
4
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Personnel update/correction:
- Wenjing came back from China before the full travel ban took effect.
Now working from home during self-quarantine period.
Tickets:
- old / now solved ticket 144783 12-Jan-2020 AGLT2: lost heartbeat
- new / assigned ticket 144982 28-Jan-2020 AGLT2: lost heartbeat.
Found and retired one particular worker node failing all jobs
for what looked like a file system problem, but probably not related.
No other acute problem found.
Still suspect that most of these errors came from the global pilot problem active around that time.
Currently only 5% of failures come from lost heartbeat.
https://bigpanda.cern.ch/errors/?computingsite=AGLT2_UCORE&jobstatus=failed
Hardware:
- Last R740XD2 online and in production for dcache.
Finishing migration and retirement of oldest dcache disk shelves at MSU.
Services:
- xrootd.aglt2.org certificate SANs restored.
-
5
MWT2Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- GGUS Ticket 144840 “Auth failed” on xroot downloads
- Very low failure rate on download tests (less than 1%) and small failure rate on jobs.
- Caused due to some weird bug in xrootd version. Fixed in a newer version (need to update)
- GGUS Ticket 145103
- Jobs failing with stage-out (permission denied) error.
- Srm access log shows files failing to srmLS and then being placed into disk and succeeding and srmLS after.
- Think it’s a similar issue to Ticket 144840. Need to update dcache and xrootd and keep an eye on it after.
UC:
Extra network equipment connected. The new storage nodes that were waiting to be added, have been.
IU:
Running smoothly
UIUC:
Running smoothly
- GGUS Ticket 144840 “Auth failed” on xroot downloads
- 6
-
7
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
1) Generally smooth operations over the past two weeks
2) Almost done with the replacement of the UPS batteries (finish this evening).
3) Investigating accounting irregularities.
OU:
- Not much to report, smooth operations
- Issues with xrootd file system under-reporting space '?oss.cgroup=ATLASDATADISK', investigating.
- Also, xrootd daemons still not rotating logs correctly, need help with that
- Sign up for the OSG All Hands Meeting (AHM):
-
8
WBS 2.3.3 HPC OperationsSpeakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
-
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
- 9
-
10
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 11
-
WBS 2.3.5 Continuous OperationsConvener: Robert William Gardner Jr (University of Chicago (US))
-
12
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13
Analytics Infrastructure & User SupportSpeaker: Ilija Vukotic (University of Chicago (US))
4 data nodes added to ES. One old data node removed. Some slowness issues reported by CERN users. Investigating. A new Dash based platform for analytics on perfsonar data. I did all the boilerplate code, should be customized by Petya. Some issues with rucio-events data, some changes to jobs_archive data. All collection services and platforms running fine.
-
14
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x)Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
12
-
15
AOB
-
1