US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 1:00 PM 1:10 PM
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)) , Dr Shawn Mc Kee (University of Michigan (US))
    • 1:10 PM 1:20 PM
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin) , Matyas Selmeci

      Release (tomorrow)

      • HTCondor 9.0.2, BLAHP 2.1.0 (3.5 upcoming, 3.6)
      • XRootD 5.3.0 (3.5 upcoming)
      • voms client  to support requesting VOMS proxies from IAM
      • XCache 2.0.1 (3.5 upcoming)

      Miscellaneous

    • 1:20 PM 1:35 PM
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 1:20 PM
        TBD 10m
    • 1:35 PM 1:40 PM
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)) , Eric Christian Lancon (Brookhaven National Laboratory (US))
      • Currently adding more space from the DATALAKE to the ATLAS DATA-Tape and MC-Tape dCache staging pools.  This should reduce the churn that we are seeing. (ie files copied from tape to staging disk and then removed  before ATLAS copied the files away)

      Reminder - HPSS (tape system) downtime 2-Aug-21 through 7:00 pm - 5-Aug-21

       

       

    • 1:40 PM 2:00 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Some issues in the last couple of weeks
        • MWT2: Enterprise switch upgrades to main enterprise network switches at UC this Wednesday and last Wednesday.  The change last week caused IPV6 issues
        • AGLT2: IPV6 issues and a full work area caused problems.
        • MSU: Moving to new location today.
        • Illinois: Today is quarterly prevent maintenance period.
      • Get your reporting in today!!!
      • 1:40 PM
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)) , Dr Shawn Mc Kee (University of Michigan (US)) , Prof. Wenjing Wu (University of Michigan)

        1) MSU site is moving 65 WNs to the new DC, i.e. all the newer WNs (R620s, R630s, C6420s).

         2) UM site is working on the ipv6 issues on the new network. 2 causes, we solved one set of the problem by adding static IPv6 ND mapping to the gatekeeper, still working on set 2 problem from the R620s connected their data cables to the management switches

        3) Job failures: 40% failure on 20th July due to 2 errors, "payload metadata does not exit", which disappeared on 21st July. (AGLT2 has the biggest number of failed jobs for this error within usatlas, but some other sites have similar errors). "no local space" error, the home directory for the usatlas users are full after years of piling up of small files, we cleaned the space and set up a cronjob to clean it.   

         

        details about 2)

        More work nodes are having ipv6 connectivity issues (do not reach gw), there are 2 set of causes: one is  possibly  by a bug in either the juniper or the cisco switch border switches. The workaround is to add the static ipv6 ND mapping to the juniper gateway. (We have added all work nodes). Hopefully this will be resolved when we can get rid of the juniper gateway (using cisco instead) in August.  Two is the management switches (S3048) have ipv6 issues. We have ~20 R620s which need to connect to the management switches for data connections, we havn't found a solution to that yet, so retired  condor  on all R620 work nodes for now

      • 1:45 PM
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)) , Jess Haney (Univ. Illinois at Urbana Champaign (US)) , Judith Lorraine Stephen (University of Chicago (US))

        UC:

        • North border router was upgraded last week (7/14) and the south border router is being upgrading this week (7/21)
        • After the north border router was swapped out, routing moved to the south which led to IPv6 issues over the weekend and early this week. UC network engineers worked on a fix, but ultimately moved IPv6 routing through the north border router temporarily.
        • GGUS  #153052 associated with the IPv6 issue (transfer issues)
        • 0% transfer efficiency with NERSC-PDSF
        • Relocation equipment trickling in.

        IU:

        • New management nodes up and running.
        • Working on getting new PerfSonar machines set up

        UIUC:

        • SLATE node arrived. Needs built and configured.
        • Quarterly PM today (7/21)
      • 1:50 PM
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))


        GGUS tickets: 0
        HC blacklists: 0

        o Smooth operations
        o Site full except for a dip around 2021-07-15 (unknown if it's a widespread dip)

        o Advanced stages of getting ready to buy worker nodes. 
        o xrd 5.3.0 installed and working in our custom container 
        o Successfully exporting NET2_DATADISK, _SCRATCHDISK, _LOCALGROUPDISK 
        o Endpoint atlas-xrootd.bu.edu registered in CRIC 
        o Configured for HTTP-TPC, custom adler32, both work successfully
        o Getting put into "smoke tests" by Alessandra & co. 
        o Some problems remain, possibly related to transfers to dcache sites, Wei and Andy are investigating.
        o NESE Tape ATLAS endpoints have arrived, expect to be racked and cabled this week.
        o perfSonar node rebuilt with new hardware, both nodes are ipv6 now.

        o Annual MGHPCC power maintenance, August 9

      • 1:55 PM
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)) , Mark Sosebee (University of Texas at Arlington (US)) , Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        Setting up a test host, as proxy, with test version of XRootD 5.3 from OSG.  Software installed, working on configuration.

        Operations mostly smooth over period

         

        OU:

        - Smooth operations, ran low on jobs occasionally.

        - XRootD 5.3.0 installed, HTTP-TPC working, waiting to be included in smoke tests.

         

    • 2:00 PM 2:05 PM
      WBS 2.3.3 HPC Operations 5m
      Speaker: Lincoln Bryant (University of Chicago (US))

      Overall, some small issues with failed jobs due to forgetting to renew credentials over the weekend.

      Frontera running 100-node jobs for some weeks now, throughput more consistent. ~211,000 SUs (56% of allocation) remaining

      Cori reduced significantly, only a single job queued at a time.

    • 2:05 PM 2:20 PM
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:05 PM
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        NTR

      • 2:10 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:15 PM
        Analysis Facilities - Chicago 5m
        Speakers: David Jordan (University of Chicago (US)) , Ilija Vukotic (University of Chicago (US))

        A few more compute nodes showed up and have been racked, built, etc. and added to the cluster.

        A couple more interactive machines showed up, but haven't been racked and built yet. These (along with the three machines mentioned above) aren't necessary for us to go into production.

        Still waiting on the GPU machine. We believe sometime in November is when it will arrive (according to Dell).

        We've gotten a condor queue up and running. Can submit jobs from both submit hosts we're planning to have for users day 1.

    • 2:20 PM 2:40 PM
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Created FedOps team email list and documentation page, waiting on RT queue
      • XRootd 5.3.0 release should now be deployed at sites - need to add into DOMA smoke tests
        • BNL xcache updated and operational again (lost one NVME drive) - will wait for Ilija before activating VP queue
      • Mark working with Saul on topology clean-up for NET2
      • Working on Quarterly Report
    • 2:40 PM 2:45 PM
      AOB 5m