US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-09-04T13:00:00-04:00
End: 2019-09-04T15:00:00-04:00
Location: No location set

Wednesday 4 Sept 2019, 13:00 → 15:00 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Topical presentations, https://docs.google.com/document/d/1NIc67p3AB2RkYjJsP6Nx_lwPXFX03w1n2SFOgCU47ro/edit
  
  Reminder to update http://bit.ly/usatlas-capacity with new procurements and to inform Shawn.
  
  Meetings/workshop at FNAL next week:
  
  - GDB (9/10-11): https://indico.fnal.gov/event/21232/
  
  - pre-GDB (9/1): https://indico.cern.ch/event/739896/
  
  - FIM4R: (9/12): https://indico.cern.ch/event/739896/
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.5
  
  3.5.0 released last Friday: https://opensciencegrid.org/docs/release/3.5/release-3-5-0/
  
  HTCondor-CE excluded from 3.5.0 as we're expecting a new major release that adds token support
  
  OSG 3.4
  
  3.4.34 released last Thursday: https://opensciencegrid.org/docs/release/3.4/release-3-4-34/
  
  HTCondor 8.8.4 available in testing
  
  ATLAS XCache
  
  3.5.0/3.4.34 included ATLAS XCache RPMs based on Ilija's configuration. Our RPM doesn't reflect configuration of BNL, SLAC, etc. XCaches.
- 13:20 → 14:00
  Topical Report
  - 13:20
    
    Efficiency of CPU 15m
    
    Speaker: Fred Luehring (Indiana University (US))
    
    US_ATLAS_Accounting_20190904.pdf
- 13:40 → 14:25
  US Cloud Status
  - 13:40
    
    US Cloud Operations Summary 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
    
    US-cloud-summary-8_28_19.pdf
    
    US-cloud-summary-9_4_19.pdf
  - 13:45
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    new purchased computing nodes will be delivered this Thursday
    
    97 AS-1023US-TR4 Supermicro Nodes
    
    instability of dCache chimera name server
    
    solved by adding an additional name server host
  - 13:50
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    Service:
    
    1) Running smooth, no new incidents/tickets
    
    2) Follow up on the jobs failed at SIGSEGV error, still have average of 20 jobs per day, plan to remove the local installation of the gfal libraries.
    
    3) working on integrating more of site's service monitoring into check_mk
    
    Hardware
    
    1) Replaced 2 dcache database replication server with newer hardware. (Dell R610 and R710 nodes)
    
    2) Placed order for 3 Dell storage nodes for Tier2 usage (R740xd storage nodes)
  - 13:55
    MWT2 5m
    
    Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    Network instability between UC and IU due to a flaky 100G interface
    
    Began 28 August, has been recurring since then
    
    UChicago network engineers are working on troubleshooting
  - 14:00
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Minor problem with GPFS getting wedged by PanDA jobs with many inputs.
    
    Smooth operations otherwise.
    
    Lots of NESE work happening. Setting up Globus infrastructure for endpoints.
    
    Will probably buy a couple more gateways for NET2 traffic to and from NESE.
    
    Massive expansions happening at MGHPCC:
    
    1. New Harvard CANNON cluster: 100k x86 cores, 40PB storage, >1M Cuda cores
    
    2. $12M new MIT/IBM cluster
    
    3. MIT Supercloud expansion, 450 nodes, each with 2 CPU, 2 NVIDIA GPUs, lots of Ram
  - 14:05
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - No problems, all sites running well
    
    - Were slowly draining over the weekend, which seemed to be related to Condor-CE losing track of jobs, so we restarted Condor-CE and cleaned out all spool files, which caused all currently running jobs to fail, but now things look much better again and we're full.
    
    UTA:
    
    1) SLATE node is installed. Still need to finalize some configuration steps.
    
    2) Investigating some event index job failures at SWT2_CPB. Some of these were related to a storage issue over the weekend (that was fixed), but not all.
    
    3) Planning hardware deployment from our most recent purchase.
    
    4) Backup A/C unit being installed this week in the SWT2_CPB machine room.
  - 14:10
    
    HPC Operations 5m
    
    Speaker: Doug Benjamin (Duke University (US))
    
    Nothing to significant report. Reports due for the ALCC allocations that we have not used yet.
    
    Need to recompile mpi4py at NERSC and test new container before we can resume running there.
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Nothing
- 14:25 → 14:30
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility