US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
COVID-19 GPU contributions with the ML platform 15mSpeaker: Ilija Vukotic (University of Chicago (US))
-
13:35
Action Items: Communications/Site Monitoring (POSTPONED) 15mSpeaker: Fred Luehring (Indiana University (US))
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
- finished HTCondor upgraded to 8.8.8, on the CEs and the farm. A negotiator bug was triggered by the upgrade, which starves mcore jobs. Workaround in place now, production back to normal level.
- COVID-19 jobs have ramped up at BNL, 9k+ running jobs now, surpassing other sites in the past couple of days in OSG.
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
1. Condor update:
The main goal is to update everything to 8.8.8 to address the security issue.1.1) We started a big project of rebuilding all 400 work nodes of the condor cluster
One motive for rebuilding was to separate the partition used by condor jobs from the tmp partition
This also performed the update them from 8.6.13 to 8.8.8/8.8.7-1.
We started with the UM site and finished rebuilding all the work nodes (179),
2/3 of UM nodes are running 8.8.7-1, and 1/3 8.8.8, depending on the rebuild day.
The WNs at MSU are now rebuilding in batches, about 1/3 yesterday, another ~1/3 today,
the rest Thursday and Friday.1.2) updated (switched) the condor head node from sl6 /8.6.13 to sl7 /8.8.8
1.3) During the update of the main gatekeeper, we encountered a problem.
Idle ucore jobs did not get scheduled to unclaimed cores,
this was solved by updating the head node to 8.8.8
and also add a workaround to the negotiator
(to address a possible bug in the negotiator in 8.8.8)2. Job failures caused by OOM killer.
This is very likely caused by
a) there are high memory pile up jobs (a single job use 56GB memory at peak)
running on our score queue (2 GB/core)b) our site also has BOINC jobs running which use extra memory on the work nodes.
To address this issue, we stopped the BOINC jobs.
Now that we solved the problem caused by condor update,
it is a good time to monitor if the same error still exist.
Fred has been in contact with ADC to ask if possible
to put the pile up jobs in high mememory queque.c) BOINC jobs will be suspended until we understand more about the situation.
3. AGLT2 started covid19 jobs from last Wednesday
We gave them a quota up to 2000 cores, this can be expanded to 5000,
For now we do not see enough queued jobs to our site,
the average number of covid19 jobs we process is around 800.4. Ticket 146371
Weird problem about small set of files accessible via xrootd but not gsiftp.
Restarting dcache on pool node fixes it for a short time.
Shawn opened a ticket with dcache.
No resolution yet5. COVID19.
No change to access plan at UM or MSU
-
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- ICC PM today to apply GPFS client and network updates
- In the process of adding IPv6 to the UC workers. Workers are all configured. PTR records added Monday. Still need to add AAAA records
- Upgraded condor to 8.8.8-1.osg35
- Updated all workers to use the OSG rolling release
- Added COVID-19 job routes to MWT2 for running OSG COVID-19 jobs
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Smooth operations except for high temp alarms due to broken fans. Replacement fans ordered.
Site was not getting filled by PanDA for a few weeks, but it's better now.
Two more NESE gateways added anticipating ramping up. Working as NESE_DATADISK in AGIS & Rucio.
6PB NESE upgrade arrived, installed, tested, but switches from DELL have been delayed twice.
Converging on NET2/NESE tape Tier. Getting helpful feedback from BNL and others in HEP.
Fred noticed that a few of our oldest nodes were getting a ~50% failure rate, strangely, from stage-out timeouts. The problem quickly disappeared, but we haven't yet figured out what the cause was.
User complained about 5 missing files at NET2_DATADISK. They were indeed missing, marked as gone by DDM. Don't think it's related to a local issue.
-
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2:
- Running well
- Offered cores to OSG vo for COVID-19 jobs. Awaiting response
SWT2_CPB:
- Running well
- Issues with GGUS ticket 146387 are no longer occurring but waiting to see if they come back with different job mix
- Issue found with latest gratia slurm probe. Trying to get corrected data in gracc to forward to APEL.
- Backup generator test for facility is scheduled for tomorrow night.
OU:
- Nothing to report, all running well.
-
13:40
- 14:00 → 14:05
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
ATLAS ML Platform & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:05
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Analytics Infrastructure & User Support 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:30
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5mSpeakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10