US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
US ATLAS Computing Facility Capacity Spreadsheet: https://bit.ly/usatlas-capacity
Through March 2020 (FY20Q2):
- V52: CPU capacity increments & retirements
- WLCG-v52: Pledge figures from REBUS available (fill in as needed)
- WLCG-v52, Table 1: Installed storage capacity
- WLCG-v52, Table 2: FY20 Procurement plans
- WLCG-v52, Table 3: Retirements
- WLCG-v52, Table 4: AUX equipment (non-CPU, non-disk)
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
-
13:20
→
13:40
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
OSG-LHC Technical Roadmap 20mSpeaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
-
13:20
- 13:35 → 13:40
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))For all the sites that see small percentage of jobs fail with timeouts on input/output:
we are investigating interaction between rucio mover, gfal2 and xrootd. In a number of cases actual transfer was not even attempted and the reason seems to be the way rucio mover tries to stat file and get checksum. Hopefully fix will come soon, once ready we will try to get it expressly tested and deployed. This does not exclude possibility there are other issues lurking there.
Fred:
It was an OK week for production.
- There were a number of tasks that had high failure rates but from the submission side.
- Most recently in the last day looping event generation jobs that killed as a group.
- I was going to mention the Rucio transfer issue but Ilija beat me to it by providing the notes above.
- The was also an unintended Rucio release which caused trouble for about 1 day.
- Several sites had short-term issues.
- Covid jobs seemed to run OK but of course reduced ATLAS production.
- NET2 had some stage-out issues with the covid jobs.
- Looks like recovering just over a month (Feb 28 to Apr 8) of accounting data for CPB will be hard. Right now CPB is not reporting anything to the official GRACC/APEL system for the entire month of March.
- Port scanning form LHCONE????
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
incidents:
21th April, one of our new R740x2d dcache server died, the daughterboard was burnt, we got it replaced within 48 hours with dell sending an onsite technician. Before that, we submitted a JIRA ticke to declare the unavailability of the files.
Services:
We still see jobs get killed due to OOM, 200 jobs/2 weeks. This mostly happens to work nodes with less than 2GB/core, we are in the process of 1) adding more memory to work nodes with retired parts 2) disable HT for work nodes witout spare DIMM parts.
We see 60% of the cluster is being used by the analysis jobs, this might be caused by our recent reconfigurtion of condor and gatekeeper in order to balance giving enough cores to covid-19 jobs and having less fragementation in condor cores. Too many analysis jobs seem to increase the failure rate of jobs in the site.
Condor is updated to 8.8.8
Hardware:
Retired 20TB usable space from dCache to get spare parts to cover the storage enclosures not under warranty anymore.
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- Fixing storage issues at UC. Two of our older out-of-warranty servers have been having controller issues. Currently draining the pools that are still online and trying to recover data from the pools that are failing
- The root disk on the UC gatekeeper filled up, causing job failures this morning
- NVIDIA drivers updated on the ML platform
- LOCALGROUPDISK filled up last Friday. Cleanup ongoing, now down to 97% full
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2:
- GGUS-Ticket-ID: #146691 Concerns SAM test for Frontier setup. Only the test is affected, jobs are fine. Test probably needs to be updated.
- Ramping up OSG Covid-19 jobs
SWT2_CPB:
- GGUS-Ticket-ID: #146694 Same issue as seen above.
- GGUS-Ticket-ID: #146387 now closed.
- Met with networking staff for IPV6 discussions. They are evaluating options before committing to timeline.
OU:
- Not much, all running well
- Upgraded xrootd to 4.11.3, which fixed space reporting and logging, and were able to delete some old data from OU_OSCER_ATLAS_LOCALGROUPDISK
- There were a number of tasks that had high failure rates but from the submission side.
- 14:00 → 14:05
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
ATLAS ML Platform & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
Running smoothly.
Opportunistic folding at home got us to sixth place:
-
14:05
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Analytics Infrastructure & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:30
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5mSpeakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
Slowly ramping up with XCaches and VP.
AGLT2 - replaced their node with the new one with more storage. Change them to direct access.
Prague - running smoothly. Will upgrade further this or next week.
LRZ - issue with the clean up, managed to cross HWM.
ROOT TChain bug discovered and fixed. Waiting for the LCG build to get it in production.
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10