US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (this week)

      • OSG 3.5-upcoming
        • XRootD 5.3.4 (fixes for data origins)
        • HTCondor 9.0.8 (pending testing, available in osg-upcoming-testing)
      • OSG 3.6
        • XRootD 5.3.4 (fixes for data origins)
        • HTCondor 9.0.8 with working proxy delegation (pending testing, available in osg-testing)
        • CVMFS 2.9.0 (pending testing, available in osg-testing)
        • oidc-agent 4.2.4 (requires a restart of the agent)
      • OSG 3.6-upcoming
        • HTCondor 9.4.0 (pending testing, available in osg-upcoming-testing)

      Miscellaneous

      • How's testing of XRootD in 3.6 going?
      • Site plans for EL8 vs EL9?
      • HTCondor-CE updates to support tokens
        • Known issue with C-style comments outside of routes in JOB_ROUTER_ENTRIES (thanks for the report, Wenjing!): https://opensciencegrid.org/docs/release/notes/#known-issues
        • CEs on token-supporting versions of HTCondor-CE
          • gate01.aglt2.org
          • gate02.grid.umich.edu
          • gate04.aglt2.org
          • gridgk05.racf.bnl.gov
          • iut2-gk.mwt2.org
          • osg-gk.mwt2.org
        • CEs on old versions of HTCondor-CE
          • atlas-ce.bu.edu
          • bgk01.sdcc.bnl.gov
          • bgk02.sdcc.bnl.gov
          • gk01.atlas-swt2.org
          • gk04.swt2.uta.edu
          • grid1.oscer.ou.edu
          • gridgk01.racf.bnl.gov
          • gridgk02.racf.bnl.gov
          • gridgk03.racf.bnl.gov
          • gridgk04.racf.bnl.gov
          • gridgk06.racf.bnl.gov
          • gridgk07.racf.bnl.gov
          • gridgk08.racf.bnl.gov
          • mwt2-gk.campuscluster.illinois.edu
          • ouhep0.nhn.ou.edu
          • spce01.sdcc.bnl.gov
          • tier2-01.ochep.ou.edu
          • uct2-gk.mwt2.org
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 30m
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      As will likely be noted by others -  HC turned off BNL for several hours because of expired user proxy.

      puppetizing stand-alone xrootd server to switch over from gridftp for BNLHPC_DATADISK and BNLHPC_SCRATCHDISK

      starting the FY22 procurement process already - Storage going first.

      new Tape libraries commissioning continuing.  Initial throughput measurements (network, disk and tape) made. looking good with these synthetic tests.

       

       

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • The last week has been a relatively rocky period as CERN has had a couple of outages though the week before was good.
      • Could each site say a little about how they will spend their remaining funding.
      • We needed to finish the IPV6 rollout at NET2 and CPB.
      • What is the status of XRootD and TPC? Seems like sites are still suffering occasional server hangs.
      • Caught an issue were the ATLAS and WLCG storage information was not being properly synchronized properly.
      • The remaining sites not at OSG 3.5 need to get onto it so we can make the move to 3.6 in the first quarter next year.
      • MWT2 is working on upgrading Condor/Condor-CE but also juggling moving the UC cluster to a new location.
      • Ofer will say something about upgrading services to get ready for run 3.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)


          dCache

            11/24/2021 dcache pool umfs06_12 caused jobs to fail at staging-in files, restarting the dcache service resolved the problem.

            12/05/2021 dcache pool umfs23_1 caused jobs to fail at staging-in files, xfs_repair needed to resolve the problem.

            11/30/2021 Updated dCache from 6.2.32 to 6.2.35 to fix SRR report issue.
                       The update was smooth. We also updated the firmware and kernel, and rebooted.
                       The R740xD2 had new BIOS installed (2.12.2)

          Condor

            12/02/2021 We spotted some jobs nearly flooding one work node with a small disk/core (14GB),
                       so we changed the max disk from 15GB to 13GB/core for the AGLT2 PanDA queue,
                   this will stop reconstruction jobs from coming in.
                   This is likely caused by a bug in condor 9.0.6 (schedule jobs to work nodes with insufficient disk space).
                   ADC also mentioned they could work on reducing the intermidiate file sizes of the reconstruction jobs.

            12/06/2021 Did a rolling upgrade on condor from 9.0.6 to 9.0.8 to address a bug
                       (Condor sends jobs to work nodes with insufficient disk space).
                   The update went smoothly, We first did the work nodes without draining,
                   and that requires setting a longer SHUTDOWN_GRACEFUL_TIMEOUT to 3 days
                   to allow all remaining jobs to finish before condor restarts the StartD after the upgrade,
                   however the condor_master itself does not get restarted.
                   Then we did the sched nodes and head nodes and restarted the condor service after upgrading.

          Network

            12/04/2021 from Sat 12/04 1PM to Sun 12/05 3:30AM (13 hours)
                       we lost the hard link between the UM and MSU sites.
                   This was due to a hardware issue in the Merit service provider equipment.
                   Replaced a DWDM card (dense wavelength-division multiplexing optical card in East Lansing)
                   This meant the MSU site lost path to non-ESnet routes, including Merit DNS resolvers,
                   but now have ACL access to MSU DNS resolvers.

          Hardware

           MSU & UM working on common quotes for R740xd2
            and R6525 with currently available AMD CPUs
            planning for about 50/50 storage/compute

         

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        UC:

        • Brief power outage last week brought site down for ~1/2 day while we recovered.
        • Temperature issues at the new data center took down one of our storage nodes there. Quickly recovered.
        • First physical machine move Monday 12/6. Working on bringing moved machines up into production.
        • Dell is working on a new quote with a different CPU to get us our compute faster. All the other servers in the Dell PO have arrived.
        • Also discussing with Dell to purchase storage with our remaining funds.
        • Reverted SRR back to the old ruby space script due to empty storage shares. Will work on it next week.

        IU:

        • Dell servers arrived. They are racked and in the process of being built.
        • Remaining funds will be for compute

        UIUC: for the newly installed 24 compute nodes

        • benchmarking ETA this week
        • put in production ETA next week or 2
        • remaining funds will be for compute
      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef

         

        Preparing to retire 3TB pool (700TB) roughly speaking at the end of the year.

        Planning for updates/purchases/prep for run 3, including ipv6, networking upgrades to NESE, Additional NESE Ceph storage and gateways, UMass expansion

        NESE Tape nearing operations

         

      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Overall, running well.

        - Still seeing occasional xrootd hangups; no longer on the proxy gateway since we upgraded to the latest 5.3.4-rc2 version, but still on the storage nodes, which are still running 4.12.1. xrootd restart on the storage node usually fixes that.

        - Will upgrade backend storage to 5.3.x as soon as stable 5.3.x version is tested and available in osg-upcoming repo.

         

        UTA:

        • IPV6 testing is progressing.  We are installing the PS nodes today and will verify routing and testing within the mesh of contacts.  If all works well, we'll implement on front-end nodes of cluster.
        • UTA_SWT2 had a problem with saturating the inbound pipe with input data from SWT2_CPB.  We have reduced the MAX I/O parameter in CRIC to get an easier job mix that can be supported. 
        • The logistics of moving UTA_SWT2 assets has gotten complicated.  Will know more after Jan 1st.
        • We are still seeing issues with WebDAV door.  We have rebuilt the existing gridftp servers to include webdav access and will move this into production momentarily after adding it to CRIC.  The intention for now is to move R/W operations to the new service, while leaving deletions on the existing WebDAV host.  After changes in DNS and a new cert We'll run all three hosts under the single name gridftp.atlas-swt2.org
    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
      • TACC
        • Offline for 'Texascale days' until Dec 11
          • Only the largest jobs can run here, at least half system size to full system size
          • It would be fun to apply for the next one!
        • Down to 85,000 SUs (22% of allocation), on track to use by end of March
      • NERSC
        • Cori running fine
        • 15M hours remain, need to use these in the next ~30 days
        • Working on Perlmutter, mcprod has assigned a task for us to play with.
      • Looking into ways to set up more alarming/alerting
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        24h power outage today.  Users have been warned.

        GPU order went to vendor. 

      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        UC AF working fine.

        Successfully run analysis on UC AF while using ServiceX deployed on NCSA Fabric nodes.

    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • First pass at site tasks for Run 3 (see below)
      • F-S DevOps
        • SLATE node kernel updated at BU; Augustine and Saul added to FedOps email list
        • MWT2 squid moved (generated GGUS ticket)
        • Patrick/Horst added to atlas-squid group
        • Automated email deployment held up due to MailGun/Github issue
      • ATLAS-WLCG CRIC syncing problem pointed out by Fred - fixed
      • Status of SWT2 decomissioning?

       

      Site Tasks in Preparation for Run 3

      Top Priority
          •    Confirm that SRR is working reliably and remove any SRM fallback for TPC
          •    Update site CEs to HTCondor-CE 5.1.1 or higher (Done: AGLT2, In process: BNL)
          •    Update site batch to HTCondor 9 (Done: AGLT2, BNL)
          •    Update site to OSG 3.5 (NET2, SWT2)
          •    Site support for IPV6 (NET2, SWT2)
          •    Switch SWT2 from LSM to Rucio Mover

      Next Highest Priority
          •    Update sites to OSG 3.6
          •    Update to dCache 7 series (Done: BNL)
          •    Update SGE (NET2) and Slurm (SWT2)?

      Link to Facility Services Spreadsheet

      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
        • XCaches in SLATE
          • two MWT2 XCaches are down. One is in transport, other one behaving strange will get inspected today.
          • everything else works fine.
        • VP
          • Working fine.
          • BNL VP queue still has no resources behind it.
        • ServiceX
          • Working fine.
          • A lot of improvement developments waiting to be put in production.
        • Rucio
          • VP integration ongoing (slowly).
      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        While waiting for the UTA_SWT2 decommissioned hardware, which will be used for Kubernetes cluster at CPB, we are working on a faster option, to start with fewer machines, before the main chunk of UTA_SWT2 hardware arrives.

    • 14:55 15:05
      AOB 10m