US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-01-30T13:00:00-05:00
End: 2019-01-30T15:00:00-05:00
Location: No location set

Wednesday 30 Jan 2019, 13:00 → 15:00 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Google slides
  
  Attending: Ilija, Lincoln, Saul, Rob, Horst, Xin, Mark, Armen, Patrick, Wei, Wenjing, Brian, Ofer, Brian L, William, Fred
  
  Apologies: Eric, Doug, Shawn
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  3.4.23 (Released 2019-01-23)
  
  Singularity 3.0.2 (upcoming)
  
  HTCondor 8.8.0 (upcoming): Note changes in job router matching
  
  3.4.24
  
  XRootD 4.9.0 RC4 just released upstream
  
  Singularity 3.0.3 (upcoming)
  
  Other Projects
  
  Base XCache docker image pushed to Docker Hub. Still working on the ATLAS XCache implementation.
  
  Updated suggested account for supporting opportunistic ATLAS jobs (documentation)
- 13:20 → 13:40
  Topical Report
  - 13:20
    
    WBS 2.3.5 Continuous Integration & Operations (CIOPS) 10m
    
    Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group), lincoln bryant
    
    WBS 2.3.5 CIOPS organization
    
    XCache Deployment Milestone
    
    Wei: "high availability"? It's only a cache... you can lose the data, no problem. And you can have multiple caches to back it up. Worried about perception of HA term.
    
    Ilija: if we go for a model where all sites have these caches, it will become an important service. Updates, new features, want to refresh the site. Want service to come back quickly.
    
    Wei: reboots should be okay. And you might have a backup xcache anyway. It should be flexible.
    
    Rob: Understood.. we need a better term.
    
    Wei: Page 4: concerns about stability goals, and what's possible for access via cache or direct to the origin.
    
    Xin: where should it be located within the site? Ans: close to compute.
- 13:40 → 14:25
  US Cloud Status
  - 13:40
    
    US Cloud Operations Summary 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
    
    US-cloud-summary-1_23_19.pdf
    
    US-cloud-summary-1_30_19.pdf
  - 13:45
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    dCache upgrade (v3.0 to v4.2) done on 01/22
    
    NFS4.1 interface not working after the upgrade, under investigation with dCache developers. Affected local users.
    
    CEs are all updated to HTCondor-CE version 3.2.0
    
    CentOS7 migration
    
    moving to native SL7 hosts from local containers in March (probably combined with UCORE migration)
    
    SCRATCHDISK space
    
    1.5PB. Long standing issue with slow deletion.
    
    ADC suggesting to reduce size by 1PB (move to DATADISK). Under discussion.
    
    IPv6
    
    done. SE dual stack
  - 13:50
    
    AGLT2 5m
    
    Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    Services:
    
    Services are running smooth, no incidents during the past 2 weeks.
    
    The high load Condor work nodes only happens once on one work node in 2 weeks, much less frequent than before.
    
    Hardware:
    
    Retired a Dell M610 Blade to make space for the new work nodes (9 Dell C6420 work nodes, each with 56 HT CPUs, intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz). New work nodes are still in the process of getting online.
  - 13:55
    MWT2 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Equipment orders for UC have been submitted.
    
    A network downtime has been scheduled for February 6.
    
    Facility milestones - to be updated for next time, for all three sites (UC, IU, UIUC)
    
    CentOS7 migration
    
    SCRATCHDISK space
    
    IPv6
  - 14:00
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Transitioned to single UCORE queue for NET2.
    
    Networking from NET2 to NESE at 2 x 100G working. Testing NESE as an ATLAS DDM endpoint to follow.
    
    On deck....
    
    Preparing to purchase worker nodes, probably more C6420s.
    
    Finish retiring old Harvard Tier 3
    
    Finish switching from custom LSM to rucio (we got kind of stuck on this with a mysterious globus related error in PanDA).
    
    Buy & install SLATE node
    
    Migration to SL7
    
    IPv6
    
    Smooth operations with full site otherwise.
  - 14:05
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Not much to report, operating smoothly
    
    - Updated squid configuration at all sites
    
    - Scheduled OSCER maintenance today, should be transparent to Panda, just will just be held (queued) in SLURM
    
    UTA:
    
    Updated Squid configuration at both sites.
    
    Low level deletion issue observed at SWT2_CPB (hard to replicate)
    
    There will be a short power outage on 1/4 power feeds at UTA_SWT2 on Monday morning. We expect that this will only affect some compute nodes.
  - 14:10
    
    HPC Operations 5m
    
    Speaker: Doug Benjamin (Duke University (US))
    
    USHPC30Jan.png
    
    Here the production in US HPC's for the past 14 days. Attached as image to the agenda.
    
    We have exhausted our allocation at OLCF and are now in the over-burn period.
    
    Kibana at Chicago reports different # of events from BigPanda monitoring - Jira ticket - https://its.cern.ch/jira/browse/ATLASES-68
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Nothing to report
- 14:25 → 14:30
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility