US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))

      Networking: 

      1. Bandwidth connecting to LHCONE through ESNet will be upgraded to 5Gbps in April 2025
      2. Solutions to make use of the TW-US link and TW-JP (shared 10Gbps) link as well as TW-SG-AMS link(shared 10Gbps) for WLCG at the same time require further discussions 

      Compute: Will speed up making all job slots (2.2K) online after migration from CentOS to AlmaLinux9

      Site: No particular problem on site functionality was found

    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      EL9 at MSU
        all satellite/capsule configurations seem resolved
        all infrastructure/firewall issues seem resolved
        still working on node and build configurations
        intial test with R410 hit problem (no install support for disk controller)
        switched target to R630
        now builds to completion
        sorting out issue: not self-registering via subscription-manager as part of build

      Fallout from issue with dcache last December has subsided
        last problem noticed was 1 lost file a week ago

      Using nftables on EL9
        ported the iptable confifuration via cobbler to nftable via ansible

      question: do we need to do something now about the SAM test hickup last week?
      We still see a 31h gap in monitoring

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant

      Updated condor to 24.0.6-1 for the security release

      Updated ingress-nginx on our k8s clusters for https://kubernetes.io/blog/2025/03/24/ingress-nginx-cve-2025-1974/

      Troubleshooting 'Stale file handle' container issues. Possibly related to our CVMFS configuration on high-core count workers

      Updating CVMFS to 2.12.7

    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Site drained last weekend owing to Harvester certificate expiring, it had to be renewed by hand.  Eduardo and Fernando are working on making this process automated for kubernetes sites.

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB: 

      • Operations

        • No major site issues to mention. We have been running very smoothly.

        • We noticed a slight decrease in running job slots and increase in wrapper faults on 3/30 (weekend), but this was brief and now resolved. 

        • Jobs appear to have decreased for a moment last night (4/1) as well. Will look into this.

        • Maintenance of data centers as usual (replacing drives in storage, addressing two problematic worker nodes, monitoring). 

      • EL9 Migration Updates

        • Continuing to make slight improvements to our modules in both the test and production clusters. 

        • Continuing to develop and test modules for XRootD proxy and storage in the test cluster. 

      • New Storage

        • We physically installed additional new storage in racks.

        • Ran into a few issues while trying to deploy new storage. 

          • Tested new third-party rails, but are deciding against using it. Contacted Dell for their suggestions on solutions as we also research and plan.

          • DHCP requests issues interfering with Rocks ability to provision with EL7. This has been resolved. iDRAC devices were set to DHCP. Temporarily removed a module that manages this to fix/test in the test cluster before adding it back into production. Manually set these devices to static for now which resolved the issue.

          • TFTP server issues. Provisioning nodes leads to TFTP timeout issues with Rocks. Investigated this and resolved it. 

          • Configured and tested provisioning of new storage in the test cluster. Rocks does not seem to support UEFI, but does work with BIOS. Setting new storage to BIOS boot mode causes M.2 in the BOSS-N1 boot controller to not be detected. Working on a solution and continuing to test for now.

        • We plan on getting some of our new storage online and in service once we overcome these issues. 


      OU:

      • Not much to report, running well; just some occasional storage overloads
      • EL9 /lscratch deletion bug seems to be fixed, so we will start migrating nodes from el7 container to el9 bare metal, a few nodes at a time