US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-08-18T13:00:00-04:00
End: 2021-08-18T14:45:00-04:00
Location: No location set

Wednesday 18 Aug 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.5 enters critical/bug fix only support at the end of this month: https://opensciencegrid.org/technology/policy/release-series/#life-cycle-dates.
  
  Now is the time to start updating HTCondor-CEs to 3.5 upcoming to get ready to accept tokens
  
  AGLT2
  gate01.aglt2.org
  gate02.grid.umich.edu
  gate03.aglt2.org
  
  BNL
  gridgk01.racf.bnl.gov
  gridgk02.racf.bnl.gov
  gridgk03.racf.bnl.gov
  gridgk04.racf.bnl.gov
  gridgk06.racf.bnl.gov
  gridgk07.racf.bnl.gov
  gridgk08.racf.bnl.gov
  
  BU
  atlas-ce.bu.edu
  
  MWT2
  iut2-gk.mwt2.org
  uct2-gk.mwt2.org
  
  SWT2
  gk01.atlas-swt2.org
  tier2-01.ochep.ou.edu
  
  OSG will host another token hackathon this month – admins are free to come with questions, for help updating their CEs, etc.
  
  Pre-GDB token workshop in October (11-12?)
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 10m
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  writing to Tape from ATLAS has been paused so we can reduce the number of tapes used for writing. We have reduced the number of tape drives used for writing to 4 drives (each capable of writing 200 MB/s) we can write 67 TB a day to tape and we have almost 400 TB to write from internal HPSS disk cache to tape. Can not reduce the number of drives any more
  
  We having a problem staging files from Tape to dCache disks. dCache is not pulling files from the HPSS Cache and we are seeing a lot of bad dCache restore requests. We are investigating and actively trying to clear up the situation so that data can flow.
  
  This problem was triggered by a large request on Saturday mid-day >200k files. Exact source of the request is under investigation
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20210818.png
  
  Success-20210818.png
  
  Transfer-20210818.png
  Good running with scheduled downtimes at AGLT2, MWT2, and NET2.
  
  A few mysterious periods of draining that I will ask the relevant sites about.
  
  Nearing (I think nearing!) getting final quotes for the FY21 purchases.
  
  Dell will raise the prices on Sep 1 and this means we need POs by Aug 31.
  
  Prices are up - way up.
  
  Almost no choice in CPUs - most are back ordered to ~January 2020. For compute servers the only viable AMD processor is the EPYC 7302 (16C/32T @ 3.0 GHz) which forces either 2 GB or 4 GB per thread. (Last year used the EPYC 7402 (24C/48T @ 2.8 GHz which yielded 2.7 GB per thread.)
  
  I need to know how many compute servers each site wants and how much money each site has to spend on compute servers.
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    MSU
    
    - last of 4 waves done 11-Aug-2021, mostly MSU T3, thus finishing equipment move from dept bldg to data center.
    
    - purchase planning: 3x ESXi hosts + 1x NVMe storage + ~3 dcache storage nodes + as many 1U WNs as budget allows.
    
    - FYI@MWT2: we may be decommissioning the EX9208 this week, to be confirmed.
    
    UM:
    
    - draining gate03 for condor-ce update, (we lost 1000 jobs when we did update on gate01 with running jobs, to be cautious, we drain the gatekeeper first)
    
    - patched a bunch of nodes with ipv6 issues (adding ipv6 neigh rules manually).
    
    - did 2 condor update (8.4.13->8.4.14->8.4.15)
    
    - rebuilt all Tier2 WN with CentOS7
    
    - finished rebooting all nodes to the new 1160.36 kernel
  - 13:45
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Network downtime at UChicago this past Monday. Line card replaced on our EX9208 in preparation for our move to our new datacenter
    
    Elasticsearch upgraded to 7.14
    
    Continuing to work on migrating our squid traffic onto the new SLATE squids
    
    Discussing purchasing for new compute
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    MGHPCC scheduled maintenance.
    
    NESE_DATADISK was down for an additional day for Harvard re-networking.
    
    xrootd is working now, passing smoke tests, HTTP-TPC.
    
    High priority items:
    
    1) Prepare for worker node purchase
    2) Expand xrd cluster, switch over from gridftp to xrootd
    3) OSG 3.5 update
    4) ipv6 finish
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    SWT2-running-last-week.png
    
    OU:
    
    - Nothing to report, running well.
    
    - Upgraded xrootd proxy (se1) to 5.3.1, seems to run well.
    
    UTA:
    
    - Preparing to install the compute nodes + storage from our last purchase. Logistically this will allow us to move forward with the move / retirement of UTA_SWT2.
    
    - About to install the latest version of XrootD on the HTTP-TPC test instance. Need to verify the ROCKS recipe for building the host as a final step prior to production deployment.
    
    - Need to schedule a downtime to install the LAN networking upgrade hardware. Many needed software updates will occur during this outage.
    
    - Recent operations generally stable, smooth.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    NTR
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    A DTN w/ 100gbps is online
  - 14:15
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    GPU arrived, not yet set up. Otherwise nothing to report on hardware side.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  XRootd checksum issue at BU is resolved, running smoke tests successfully
  
  F-S DevOps mtg
  
  Fred has draft of installation documentation from installing test server at IU
  
  Jess deploying new server at Illinois (will further refine documentation)
  
  Ilija tracked old squid usage to OSG jobs using CVMFS
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-8_11_21.pdf
    
    US-cloud-summary-8_18_21.pdf
  - 14:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    ES upgraded yesterday to 7.14. Changing it's monitoring and preparing it for big upgrade (8.0). Manually indexing missed data.
    
    XCache - all up and running. Now SLATE xcaches auto-configured through GitOps.
    
    VP - all working fine. Working on Rucio integrations. Rucio Oracle upgrade only in September.
    
    Squids - all working fine. Preparing things for retirement of "old" squids.
    
    Rucio GeoIP still not tested due to unrelated changes that make it python 2.7 incompatible. Now version 1.26.2
  - 14:30
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Information on status with hardware move to CPB in Mark's report.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder