US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
    • 11:10 11:20
      AGLT2 10m
      Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Generally smooth running.
      Fewer issues with cvmfs.

      27-Mar site downtime for 2nd attempt to replace UPS breaker at UM.
      The wrong part was dispatched.  Complication comes from our model
      versus current model.  Getting College (UPS owner) involved.
      We had wound our site down and used the opportunity to apply
      firmware and kernel updates to all worker and dcache nodes.
      Investigating apparent hardware problem with 1 of the 3 VMware nodes at UM. 

      Milestone efforts:

      Linux9: Continuing progress on using RedHat Satellite for provisioning.
      Now have method for provisioning directly from connection used for bond.
      Making progress on using ansible for configuration management using
      the ansible-pull method.

      WLCG SOC operational: Had a deadline of 31-Mar.
      We are not quite there, but close to deployment at UM and MSU.
      Pre-requisite migration to OpenSearch completed. Capture nodes
      are deployed and provisioned. Traffic taps should be ready.
      We will understand our status better in the next couple days
      and update the milestone.

    • 11:20 11:30
      MWT2 10m
      Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))
      • All the IU and UC workers have been rebuilt as EL9.
      • We are currently working on rebuilding storage and management servers.
      • UIUC April PM was canceled.
      • IU switch migration and network changes caused issues between IU and CERN and took down IU squid on March 14th. It was fixed by removing a static route from IU enterprise network.
      • MWT2 squid was degraded on March 16th for a couple of hours. This was due to a network configuration issue on one of our servers. In the meantime the backup server picked up the load.
    • 11:30 11:40
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Storage is back in production sisnce last Thursday.

      Cluster is back online since last night, Thusday. Apart from some old machines bailing out under load, the cluster is stable. 

       

    • 11:40 11:50
      SWT2 10m
      Speakers: Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB -

      • UPS re-fresh was (finally) performed, resulting in better redundancy 
      • Replaced the admin node for the cluster with better hardware during the UPS downtime
      • Replaced WAN switch in the data center that had reached end of life
      • Discovered a subtle side effect of having old / new admin nodes up in parallel - fixed - resolved a bad data transfer / deletions issue
      • Testing LSM replacement
      • Currently setting up and testing tokens for the storage DTN's - completed set up of test DTN for testing purposes before implementation in Tier 2
      • Troubleshooting problematic WN's