US ATLAS Computing Facility
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
Topical presentations, https://docs.google.com/document/d/1NIc67p3AB2RkYjJsP6Nx_lwPXFX03w1n2SFOgCU47ro/edit
Reminder to update http://bit.ly/usatlas-capacity with new procurements and to inform Shawn.
Meetings/workshop at FNAL next week:
- GDB (9/10-11): https://indico.fnal.gov/event/21232/
- pre-GDB (9/1): https://indico.cern.ch/event/739896/
- FIM4R: (9/12): https://indico.cern.ch/event/739896/
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
OSG 3.5
- 3.5.0 released last Friday: https://opensciencegrid.org/docs/release/3.5/release-3-5-0/
- HTCondor-CE excluded from 3.5.0 as we're expecting a new major release that adds token support
OSG 3.4
- 3.4.34 released last Thursday: https://opensciencegrid.org/docs/release/3.4/release-3-4-34/
- HTCondor 8.8.4 available in testing
ATLAS XCache
3.5.0/3.4.34 included ATLAS XCache RPMs based on Ilija's configuration. Our RPM doesn't reflect configuration of BNL, SLAC, etc. XCaches.
-
13:20
→
14:00
Topical Report
-
13:20
Efficiency of CPU 15mSpeaker: Fred Luehring (Indiana University (US))
-
13:20
-
13:40
→
14:25
US Cloud Status
-
13:40
US Cloud Operations Summary 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
- 13:45
-
13:50
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
Service:
1) Running smooth, no new incidents/tickets
2) Follow up on the jobs failed at SIGSEGV error, still have average of 20 jobs per day, plan to remove the local installation of the gfal libraries.
3) working on integrating more of site's service monitoring into check_mk
Hardware
1) Replaced 2 dcache database replication server with newer hardware. (Dell R610 and R710 nodes)
2) Placed order for 3 Dell storage nodes for Tier2 usage (R740xd storage nodes)
- 13:55
-
14:00
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Minor problem with GPFS getting wedged by PanDA jobs with many inputs.
Smooth operations otherwise.
Lots of NESE work happening. Setting up Globus infrastructure for endpoints.
Will probably buy a couple more gateways for NET2 traffic to and from NESE.
Massive expansions happening at MGHPCC:
1. New Harvard CANNON cluster: 100k x86 cores, 40PB storage, >1M Cuda cores
2. $12M new MIT/IBM cluster
3. MIT Supercloud expansion, 450 nodes, each with 2 CPU, 2 NVIDIA GPUs, lots of Ram
-
14:05
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- No problems, all sites running well
- Were slowly draining over the weekend, which seemed to be related to Condor-CE losing track of jobs, so we restarted Condor-CE and cleaned out all spool files, which caused all currently running jobs to fail, but now things look much better again and we're full.
UTA:
1) SLATE node is installed. Still need to finalize some configuration steps.
2) Investigating some event index job failures at SWT2_CPB. Some of these were related to a storage issue over the weekend (that was fixed), but not all.
3) Planning hardware deployment from our most recent purchase.
4) Backup A/C unit being installed this week in the SWT2_CPB machine room.
- 14:10
-
14:15
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:20
-
13:40
-
14:25
→
14:30
AOB 5m
-
13:00
→
13:10