US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      We have an early deadline for quarterly reports because of the review at the end of the month.  Reports are due by Friday, January 14, 2022 (week from this coming Friday).  To allow Rob and Shawn to get our WBS 2.3 version completed, we need the level 3 (WBS 2.3.x) reports done by Wednesday, January 12, 2022.  Please try to get these completed ASAP.

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Several packages ready for testing:


      3.5.53-upcoming and 3.6:
          - HTCondor-CE 5.1.3 (various bugfixes, see https://opensciencegrid.atlassian.net/browse/SOFTWARE-4951)
          - XRootD 5.4.0 (new features and bug fixes, see https://github.com/xrootd/xrootd/releases/tag/v5.4.0)
      3.6 only:
          - oidc-agent 4.2.4 (new major version, see https://github.com/indigo-dc/oidc-agent/releases for changes since 3.3.3)
          - cvmfs 2.9.0
       

    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 30m
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      currently moving CPU servers from old data center to new data center today. coming online later this week.

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • No big crises during the holiday break.
        • All sites had some problems.
        • The was a major issue at CERN that messed up the monitoring but the missing data is now available.
      • The quarterly reporting is due early this year. I want your input by the end of the day next Tuesday I listed 4 specific items that I want each site to address in their report on (some sites will simply report that they in the final configuration for the start of Run 3 for some items):
        • Updating OSG and Condor versions.
        • Updating storage version.
        • Updates to the queuing system.
        • IPV6
      • Seems like XRootD may be making progress toward stable HTTP-TPC transfers.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        - updated ELK for log4j issues, applied other security updates. 
        - MSU received SLATE node, being installed with Alma Linux (per SLATE team request)
        - MSU received network capture node, will also be Alma Linux (for Milan CPU)
        - MSU received VMware storage node
        - purchase plan:  R740xd2 with 18T drives, R6525 with AMD 7452 (128 HT/node), final count to be determined after final quotes. Roughly $500k total and roughly 50/50 for storage/compute.

        - Rebooting the cisco border switches caused ipv4 issues among various machines on the UM site, it caused cvmfs failure and squid servers failover. It took a couple of days to debug (between UM ITS and cisco support)to fix the issue. 

        - Another slate squid issue: does not show traffic on the CERN squid monitoring, had to rejoin the nodes to the k8s cluster to fix it. 

        - A patch was applied to the cisco border switches, which fixed the IPV6 forwarding (to Dell management switches) issues, so we were able to bring all the R620 work nodes whose data connections are through the Dell management switches back to condor.

        - Merit Networks has had another issue on MiLR (our network that connects us to Chicago and East Lansing).    This has broken our default route and access to and from AGLT2 from non research and education (R&E) networks.

        - a typo in the routing rules change caused ipv6 ping failure to all CERN machines, a lot of jobs fail at rucio timeout. It was fixed the next morning. 

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        Upgraded uct2-gk to htcondor-ce 5.1.2 and condor 9.0.8 this morning

        One of the UC dCache nodes went offline December 26th. Pools were brought back up that day

        Second set of dCache transfers finished for the UC server room relocation. Next move is scheduled for January 24th

        New IU and UIUC compute nodes online. Revised UC order submitted, still waiting on an estimate shipping date

        Surplus UC servers arrived at IU. Fred is in the process of installing

        Discussing upcoming purchase order. Fred is working on benchmarking and quotes

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef

         

        o We had staging issues over the break and had to limit the total number of jobs by hand.

        o Down time on Tuesday Jan 11  for

        • Retiring 3TB pool (770TB) 
        • NFS kernel upgrade
        • Preparations for new worker nodes 

        o Adding 4 DTN nodes to increase the GPFS-worker bandwidth.  

        o About to place orders on a new NESE Ceph rack to add to NESE_DATADISK.  3.8PB raw, 12 new DTN nodes. 

        o NESE Tape working, coming online. 

        o Pressing Harvard on ipv6.

        o Plans for NET2 expansion with UMass, bare metal cluster, etc. nearing finalization.  

      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, all running well.

        UTA:

        Problems still occurring with WebDAV door.  We are going to upgrade the version of XrootD and setup the existing gridftp servers to take the load of transfers.

        Over the break, we had one small downtime as the chilled water for the lab was being worked on.  Fortunately the cooling was maintained and were able to come back quickly.

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))

      TACC

      • SLATE node has an issue - Kubernetes has broken itself. 
        • Working with TACC team to fix this
      • Harvester broken as well, because it was using SLATE node for MySQL DB
        • Standard Sqlite installation won't work at TACC for some reason. Something strange in the environment?

      NERSC

      • Allocation approved for Perlmutter, we have 500K CPU hours and 11K GPU hours on Perlmutter starting Jan 19th for 1 year
      • Cori failing large number of jobs - logs indicate SLURM is cancelling the jobs after about 30 minutes.
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • NTR.  Compiling info for next week's GDB presentation.
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        Tier3 status update provided by Fengping Hu et al... sent to Alessandra:

         

    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • BNL HTCondor-CE's have been upgraded (thanks Xin!)
      • ANALY_BNL_VP queue issues were traced to a deactivated CE, followed by a problematic "flavour" value in CRIC, then a maxWallTime=0 pilot setting....jobs seem to be running again as of this morning
      • File transfers, and staging, apparently continued during the BNL tape service downtime on 12/29 13:00-17:00 UTC (link).  Why?
      • MWT2 squid service briefly degraded after SLATE reconfiguration and failed restart
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        • working fine. restarted all SLATE instances to get some small changes in. 
        • AGLT2 nodes needed intervention from Wenjing and Mohammad

        VP

        • working fine
        • BNL_VP now getting jobs but jobs failing. Xin and Ofer looking at it. Not related to XCache

        Analytics

        • ES running fine. Preparing the next batch of servers for transport
        • updating all the logstashes. there are 4 running.
        • Updating Alarm & Alert frontend.

        ServiceX

        • stress testing
        • testing for graceful handling of errors

         

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        While waiting for the UTA_SWT2 decommissioned hardware, which will be used for Kubernetes cluster at CPB, we are working on a faster option, to start with fewer machines, before the main chunk of UTA_SWT2 hardware arrives.

    • 14:55 15:05
      AOB 10m