US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-12-08T13:00:00-05:00
End: 2021-12-08T15:10:00-05:00
Location: No location set

Wednesday 8 Dec 2021, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  OSG 3.5-upcoming
  
  XRootD 5.3.4 (fixes for data origins)
  
  HTCondor 9.0.8 (pending testing, available in osg-upcoming-testing)
  
  OSG 3.6
  
  XRootD 5.3.4 (fixes for data origins)
  
  HTCondor 9.0.8 with working proxy delegation (pending testing, available in osg-testing)
  
  CVMFS 2.9.0 (pending testing, available in osg-testing)
  
  oidc-agent 4.2.4 (requires a restart of the agent)
  
  OSG 3.6-upcoming
  
  HTCondor 9.4.0 (pending testing, available in osg-upcoming-testing)
  
  Miscellaneous
  
  How's testing of XRootD in 3.6 going?
  
  Site plans for EL8 vs EL9?
  
  HTCondor-CE updates to support tokens
  
  Known issue with C-style comments outside of routes in JOB_ROUTER_ENTRIES (thanks for the report, Wenjing!): https://opensciencegrid.org/docs/release/notes/#known-issues
  
  CEs on token-supporting versions of HTCondor-CE
  
  gate01.aglt2.org
  
  gate02.grid.umich.edu
  
  gate04.aglt2.org
  
  gridgk05.racf.bnl.gov
  
  iut2-gk.mwt2.org
  
  osg-gk.mwt2.org
  
  CEs on old versions of HTCondor-CE
  
  atlas-ce.bu.edu
  
  bgk01.sdcc.bnl.gov
  
  bgk02.sdcc.bnl.gov
  
  gk01.atlas-swt2.org
  
  gk04.swt2.uta.edu
  
  grid1.oscer.ou.edu
  
  gridgk01.racf.bnl.gov
  
  gridgk02.racf.bnl.gov
  
  gridgk03.racf.bnl.gov
  
  gridgk04.racf.bnl.gov
  
  gridgk06.racf.bnl.gov
  
  gridgk07.racf.bnl.gov
  
  gridgk08.racf.bnl.gov
  
  mwt2-gk.campuscluster.illinois.edu
  
  ouhep0.nhn.ou.edu
  
  spce01.sdcc.bnl.gov
  
  tier2-01.ochep.ou.edu
  
  uct2-gk.mwt2.org
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 30m
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  As will likely be noted by others - HC turned off BNL for several hours because of expired user proxy.
  
  puppetizing stand-alone xrootd server to switch over from gridftp for BNLHPC_DATADISK and BNLHPC_SCRATCHDISK
  
  starting the FY22 procurement process already - Storage going first.
  
  new Tape libraries commissioning continuing. Initial throughput measurements (network, disk and tape) made. looking good with these synthetic tests.
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20211208.png
  
  Success_20211208.png
  
  Transfer_20211208.png
  The last week has been a relatively rocky period as CERN has had a couple of outages though the week before was good.
  
  Could each site say a little about how they will spend their remaining funding.
  
  We needed to finish the IPV6 rollout at NET2 and CPB.
  
  What is the status of XRootD and TPC? Seems like sites are still suffering occasional server hangs.
  
  Caught an issue were the ATLAS and WLCG storage information was not being properly synchronized properly.
  
  The remaining sites not at OSG 3.5 need to get onto it so we can make the move to 3.6 in the first quarter next year.
  
  MWT2 is working on upgrading Condor/Condor-CE but also juggling moving the UC cluster to a new location.
  
  Ofer will say something about upgrading services to get ready for run 3.
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    dCache
    
        11/24/2021 dcache pool umfs06_12 caused jobs to fail at staging-in files, restarting the dcache service resolved the problem.
    
        12/05/2021 dcache pool umfs23_1 caused jobs to fail at staging-in files, xfs_repair needed to resolve the problem.
    
        11/30/2021 Updated dCache from 6.2.32 to 6.2.35 to fix SRR report issue.
                   The update was smooth. We also updated the firmware and kernel, and rebooted.
                   The R740xD2 had new BIOS installed (2.12.2)
    
    Condor
    
        12/02/2021 We spotted some jobs nearly flooding one work node with a small disk/core (14GB),
                   so we changed the max disk from 15GB to 13GB/core for the AGLT2 PanDA queue,
               this will stop reconstruction jobs from coming in.
               This is likely caused by a bug in condor 9.0.6 (schedule jobs to work nodes with insufficient disk space).
               ADC also mentioned they could work on reducing the intermidiate file sizes of the reconstruction jobs.
    
        12/06/2021 Did a rolling upgrade on condor from 9.0.6 to 9.0.8 to address a bug
                   (Condor sends jobs to work nodes with insufficient disk space).
               The update went smoothly, We first did the work nodes without draining,
               and that requires setting a longer SHUTDOWN_GRACEFUL_TIMEOUT to 3 days
               to allow all remaining jobs to finish before condor restarts the StartD after the upgrade,
               however the condor_master itself does not get restarted.
               Then we did the sched nodes and head nodes and restarted the condor service after upgrading.
    
    Network
    
        12/04/2021 from Sat 12/04 1PM to Sun 12/05 3:30AM (13 hours)
                   we lost the hard link between the UM and MSU sites.
               This was due to a hardware issue in the Merit service provider equipment.
               Replaced a DWDM card (dense wavelength-division multiplexing optical card in East Lansing)
               This meant the MSU site lost path to non-ESnet routes, including Merit DNS resolvers,
               but now have ACL access to MSU DNS resolvers.
    
    Hardware
    
       MSU & UM working on common quotes for R740xd2
        and R6525 with currently available AMD CPUs
        planning for about 50/50 storage/compute
  - 14:00
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC:
    
    Brief power outage last week brought site down for ~1/2 day while we recovered.
    
    Temperature issues at the new data center took down one of our storage nodes there. Quickly recovered.
    
    First physical machine move Monday 12/6. Working on bringing moved machines up into production.
    
    Dell is working on a new quote with a different CPU to get us our compute faster. All the other servers in the Dell PO have arrived.
    
    Also discussing with Dell to purchase storage with our remaining funds.
    
    Reverted SRR back to the old ruby space script due to empty storage shares. Will work on it next week.
    
    IU:
    
    Dell servers arrived. They are racked and in the process of being built.
    
    Remaining funds will be for compute
    
    UIUC: for the newly installed 24 compute nodes
    
    benchmarking ETA this week
    
    put in production ETA next week or 2
    
    remaining funds will be for compute
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
    
    Preparing to retire 3TB pool (700TB) roughly speaking at the end of the year.
    
    Planning for updates/purchases/prep for run 3, including ipv6, networking upgrades to NESE, Additional NESE Ceph storage and gateways, UMass expansion
    
    NESE Tape nearing operations
  - 14:10
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Overall, running well.
    
    - Still seeing occasional xrootd hangups; no longer on the proxy gateway since we upgraded to the latest 5.3.4-rc2 version, but still on the storage nodes, which are still running 4.12.1. xrootd restart on the storage node usually fixes that.
    
    - Will upgrade backend storage to 5.3.x as soon as stable 5.3.x version is tested and available in osg-upcoming repo.
    
    UTA:
    
    IPV6 testing is progressing. We are installing the PS nodes today and will verify routing and testing within the mesh of contacts. If all works well, we'll implement on front-end nodes of cluster.
    
    UTA_SWT2 had a problem with saturating the inbound pipe with input data from SWT2_CPB. We have reduced the MAX I/O parameter in CRIC to get an easier job mix that can be supported.
    
    The logistics of moving UTA_SWT2 assets has gotten complicated. Will know more after Jan 1st.
    
    We are still seeing issues with WebDAV door. We have rebuilt the existing gridftp servers to include webdav access and will move this into production momentarily after adding it to CRIC. The intention for now is to move R/W operations to the new service, while leaving deletions on the existing WebDAV host. After changes in DNS and a new cert We'll run all three hosts under the single name gridftp.atlas-swt2.org
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
  TACC
  
  Offline for 'Texascale days' until Dec 11
  
  Only the largest jobs can run here, at least half system size to full system size
  
  It would be fun to apply for the next one!
  
  Down to 85,000 SUs (22% of allocation), on track to use by end of March
  
  NERSC
  
  Cori running fine
  
  15M hours remain, need to use these in the next ~30 days
  
  Working on Perlmutter, mcprod has assigned a task for us to play with.
  
  Looking into ways to set up more alarming/alerting
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    24h power outage today. Users have been warned.
    
    GPU order went to vendor.
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    UC AF working fine.
    
    Successfully run analysis on UC AF while using ServiceX deployed on NCSA Fabric nodes.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  First pass at site tasks for Run 3 (see below)
  
  F-S DevOps
  
  SLATE node kernel updated at BU; Augustine and Saul added to FedOps email list
  
  MWT2 squid moved (generated GGUS ticket)
  
  Patrick/Horst added to atlas-squid group
  
  Automated email deployment held up due to MailGun/Github issue
  
  ATLAS-WLCG CRIC syncing problem pointed out by Fred - fixed
  
  Status of SWT2 decomissioning?
  
  Site Tasks in Preparation for Run 3
  
  Top Priority
     •   Confirm that SRR is working reliably and remove any SRM fallback for TPC
     •   Update site CEs to HTCondor-CE 5.1.1 or higher (Done: AGLT2, In process: BNL)
     •   Update site batch to HTCondor 9 (Done: AGLT2, BNL)
     •   Update site to OSG 3.5 (NET2, SWT2)
     •   Site support for IPV6 (NET2, SWT2)
     •   Switch SWT2 from LSM to Rucio Mover
  
  Next Highest Priority
     •   Update sites to OSG 3.6
     •   Update to dCache 7 series (Done: BNL)
     •   Update SGE (NET2) and Slurm (SWT2)?
  
  Link to Facility Services Spreadsheet
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-12_1_21.pdf
    
    US-cloud-summary-12_8_21.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches in SLATE
    
    two MWT2 XCaches are down. One is in transport, other one behaving strange will get inspected today.
    
    everything else works fine.
    
    VP
    
    Working fine.
    
    BNL VP queue still has no resources behind it.
    
    ServiceX
    
    Working fine.
    
    A lot of improvement developments waiting to be put in production.
    
    Rucio
    
    VP integration ongoing (slowly).
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    While waiting for the UTA_SWT2 decommissioned hardware, which will be used for Kubernetes cluster at CPB, we are working on a faster option, to start with fewer machines, before the main chunk of UTA_SWT2 hardware arrives.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder