US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Files lost from Dec 6-10 dCache database incident: continue watching; quiet now

        Transfers: 13-Mar GGUS: 2775/682642 Transfers failing with "file not found"
            13-Mar scan 10k errors - 892 unique files declared 856 bad/lost that day
            14-Mar 9 more files
            16-Mar 1 more file
            None since.
            Ticket not closed.  Added discussion of suspected lost logfiles.

        Job errors: None since 13-Mar

      found a few work nodes with cvmfs issues, some requires reboot to fix. 

      UM site progress on EL9, migrated more services, including AFS servers, SVN, Mariadb etc to EL9, and started to migrate to nftables for firewall.

      EL9 at MSU Status
        Campus firewall issue finally resolved 12-Mar
        Found local firewall issue on Satellite side 13-Mar
        Found Apache setup issue on Capsule side 14-Mar
        Still Capsule failing while proxying to Satellite request from provisioning node for its kickstart file
        Found error log on capsule. Not very specific.
        Verified we use a fresh token.

        Current/today lead: yesteday spotted one error message on Satellite GUI related to kickstart and snipet.
        Will ask RedHat for help if we (AGLT2+MSUIT) can't resolve soon.

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • New UC storage is now online
      • UC and UIUC workers upgraded to Condor 24.0.6. IU and gatekeepers will be upgraded soon 
      • We are planning to discuss our procurement plans soon
    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      - Few very brief blacklisting due to work on our storage and many transfer requests coming from the cluster. The investigation of the many transfer errors is ongoing.

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB: 

      • EL9 Migration Updates

        • Site has been running smoothly without any interruptions (caused by EL9 migration) for over a month. 

        • We have two storage servers in our test cluster and three in production. 

          • We are doing some tests in building new storage as EL7 in our test cluster to be cautious before building an EL7 storage server in the production cluster.

          • Anticipate having additional new storage in service in the production cluster soon. 

        • We received the new rails from a third-party vendor to test. Will make a decision on whether we want to purchase additional rails or not soon. 

        • Continuing to develop the rest of the test cluster for testing new appliance builds. 

        • Continuing to make minor improvements to the EL9 production cluster and discussing ideas.

      • Network

        • Discussing and planning for future improvements to network infrastructure. 

        • Discussed with Dell sales representative concerning potential purchases for new network equipment and servers. 

        • GGUS Tickets

          • 162991 (Network Monitoring)

            • Followed up with campus networking, they discussed this, but have not heard back yet. I received a follow up from another member from campus networking, responded, but then have not heard anything back yet. Will follow up again soon. 

          • 682412 (GoeGrid)

            • Continuing to wait for updates from ESNet for any status updates.

            • Will follow up with them soon. 

      OU:

      • Running well, no site issues I'm aware of
      • Still trying to figure out the ANALY_OU_OSCER_GPU_TEST failures in the new SLURM GPU test partition; have a lead.