US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Hope everyone is recovered from our intense face to face meeting at SLAC last week!

      We have quite a few areas to work on based on our meeting.

      • The Tier-2 procurement plans should be read and commented on by anyone/everyone in the facility:  https://drive.google.com/drive/folders/1rxwnJtNOxrfvBj7lCuiL3LZRItnJGaIB?usp=sharing.   We need to finalize them by the end of the month
      • We need to develop our mini-challenges and associated milestones for the upcoming DC24
        • The load testing framework needs to be retried and improved, along with the creation of quarterly activities
        • We want to test VP at the largest scale possible.   Perhaps one Tier-2 by end of March 2023,  then two by June 2023 and the whole facility by September 2023?
        • The work on differentiated storage and storage organization needs to be planned out
      • Everyone is encouraged to sign up for lightning talks at future meetings.

      Don't forget our R&D and F2F notes:

      F2F notes:  https://docs.google.com/document/d/1Cb7nLgJKlwy_3twM_PkIzGq7biX6q6YPFwsCdIq0VtU/edit

      Facility R&D notes:  https://docs.google.com/document/d/1uqHqygcY6Dcg1amPBZcIH0d1x-5kBNvaxjc99UxvlXU/edit 

      The WLCG DOMA meeting today discussed DC24 https://indico.cern.ch/event/1226223/ .  WLCG DOMA will be responsible for gathering information about mini-challenges and milestones (not imposing them or defining them).   Interested groups need to contribute (like USATLAS). 

      Last item:  we need to add our demonstrators in advance of the next ATLAS S&C week (January 23, 2023).  See links at top of Facility R&D notes for more details.

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Packages Ready for Testing

       

      OSG 3.6
      - Major Components [1]
        - XRootD 5.5.1
          - Fixes critical issue with XRootD FUSE mounts via xrdfs
        - CVMFS 2.10.0
          - https://cvmfs.readthedocs.io/en/2.10/cpt-releasenotes.html
        - HTCondor 10.0.0 [2]
          - Please consult the upgrade guide for major differences
            https://htcondor.readthedocs.io/en/latest/version-history/upgrading-from-9-0-to-10-0-versions.html
          - In particular, consult the second section of things to be aware of when
            upgrading

      [2] HTCondor 10.0.0 will not be released until we have a few successful upgrades.

       

      HTCondor

      • The November 2022 release of 10.0.0 means that the 9.0 EOL is pushed back to May 2023
      • HTCondor 10.2.0 is an EL9 preview release: it will have EL9 support but NOT cgroup support. EL9 cgroup support is targeted for 10.3.0
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Good running recently with few problems.
        • Legacy NET2 storage stability improved by Augustine (BU). This storage will transfer over to NET2.1.
        • MWT2 IU had serious enterprise level switch failure knocking the IU part of MWT2 offline for 4 days
        • Database upgrade at CERN exposed a bug in a Panda daemon turning off most sites for about half a day
      • Will discuss procurement plans made available as the SLAC meeting started at next week's 2.3 Management.
      • Also discussed Tier 2 milestones at SLAC meeting.
        • Procurement
        • Next OS
        • Local site
        • Discussed a proposal to create operations plans for each Tier 2 and the a global operations plan.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen

        Incidents:

        3 pools from msufs12 and msufs13 went offline, and caused file stage-in errors in jobs. Restarting the 3 pools fixed the problem. 

        Procurement: 

        In contact with Dell to get quote for the computing and storage nodes.

        software:

        Next OS an provision: Deployed CentOS 8 Stream with Cobbler for computing and storage nodes, configured by cfengine. We are testing different software deployment and configuration on CentOS 8 Stream. 

        verified that VMware 7 can use not-officially-supported MD3460 storage (Dell-customized VMware ISOs are still available on the VMware site) .  This will help the MSU transition to TrueNAS as it won't be our only storage, thus upgrading to VMware 7 is not contingent on ordering/receiving a second TrueNAS storage node.   iSCSI connectivity is still waiting on configuration.  

         

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • UC had a storage node go down for a handful of hours morning of 12/5. MWT2 ran through it. Otherwise UC had smooth running.
        • IU WAN switch failure so IU was offline: 11/29-12/2
          • Site came back online on 12/2 at half bandwidth capacity.
          • Some trouble with CRIC accepting downtime updates from topology. Only one of the condor central managers updated the extended downtime. The squid and second CM came back online after the initial downtime timeframe ended.
        • UIUC drained when IU switch went down. Once we set IU gatekeepers offline, jobs filled back up.
      • 14:05
        NET2 5m
        Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Antonio Lopez, William Axel Leight (University of Massachusetts Amherst)
        • We've heard back from BU this Monday and our team has the clearance to begin the process of moving the hardware. We are hoping to have some equipment removal started by the end of next week.
           
        • On the network front, our IT team is configuring the internal network, and the connection to ESNet comes right after that.
      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

         

        • Site running well
        • Storage issue earlier this week turned out to be a network cable on one of the xrootd storage servers which had gone bad. Replaced.

        UTA:

        • No production issues
        • Working on network upgrade
        • Closed GGUS ticket 155845, concerning the wlcg netsite at UTA

         

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
      • Focus for the next few weeks is on completing the allocation at NERSC. We have 100K hours of our original allocation remaining, plus we were given another 70K on top of that (170K total remaining)
        • Now running Cori jobs in the Premium queue (2x charge factor)
        • We are running as many jobs as we can at Perlmutter (40K - 80K cores), although there is a high number of failed jobs (30% or so). Needs further investigation.
      • Rui working on GPU jobs at NERSC, Tadashi gave us a configuration hint for getting analysis/prun jobs going there. 
      • We started investigations into VP at NERSC. Running XCache on the login nodes (Lisa's recommendation, as PSCRATCH is not available on the DTNs), plus heartbeats and monitoring. All started manually for now (no containers just shell scripts). TBD if this would scale. 
      • We now have a TACC allocation for 330K hours and 3K GPU hours. Building a new, clean configuration at TACC. Investigating Singularity+CVMFSExec to bring our own CVMFS and run any job. 

       

    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • Added batch queue status monitoring to the portal
          • shows number of running jobs, used cores, pending and held jobs
          • Observed memory used up first before cores lately. The prometheus monitoring(not currently availble in the protal) indicates in most cases the actual memory usage is much less than the requested memory. Some user engagement is needed to and we will also add monitoring improvement to help user understand the status better
        • Hardware issue
          • A node(c010) with fan issue -- reseated fan cage and cables solved the problem
    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • WLCG DOMA general meeting earlier today:  https://indico.cern.ch/event/1226223/
        • Targets and milestones for DC24 still under discussion
        • Scope and schedules for mini-challenges will be left to the sites/regions but need planning/coordination
        • Expanding effort on testing/deployment of tape REST API
      • Ongoing investigation of issues with HC tests for VP queues.
        • Manual override on queues temporarily while this is being addressed
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        The SLATE team is working out issues with the new release.  Expect to have problems resolved in about a week.  We will want to do round up of k8s edge cluster updates to 1.24.  Following the F2F meeting, we'll re-visit fed ops for XCache, Frontier-Squid, and an R&D activity with Varnish.

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        No news since the report last week.

    • 14:55 15:05
      AOB 10m