US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
      • DC24 has affected all sites determentally.
        • Two FTS outages in the last week.
        • Please discuss how DC24 has affected your site
      • Could each site please discuss their plans and status for the EL9 migration.
      • cvmfs troubles?
    • 11:10 11:20
      AGLT2 10m
      Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      No major problem.
      No problem with DC24 (so far).

      Starting 01/29/2024 we noticed a sharp increase of cvmfs failing on worker nodes.
      For a large fraction of issues ‘cvmfs_config probe’ would hang while checking the OSG section.
      It often took one or two cvmfs restart (killall) attempts to recover the node.
      On several occasion a reboot was needed to recover. This was most pronounced the first week,
      and gradually quieted down over the next 2 weeks.  No issue noticed recently.

      Two tickets about SLATE squid instance problems.
      First ticket was sl-um-es2 instance hung.
      Second ticket followed an update of the container for security update that may not have been succefully deployed.
      Seemed to have had the wrong image tag (testing) for sl-um-es3 ?

      dCache xrootd monitoring: configured but no reportsgoing to Kafka yet.
      Waiting til end of DC24 to restart doors. 

      We usually pool our MSUT2, UMT2 and UMT3 purchase to maximize discount.
      But UM T3 has to spend DOE money now.
      Selected: Compute R6625 with 32C AMD 9354 and 24x16G DIMMs (128 HT, 3G/HT).
      Storage: Selected R760xd2 with 24x20T drives Intel CPU (1x Silver 4510 2.4G 12C) and 8x16G DIMMs.

      EL9: both UM and MSU have RedHat site licenses.
      UM has been able to provision RHEL9 nodes from RH Satellite. Not MSU.
      Lots of work ahead.

    • 11:20 11:30
      MWT2 10m
      Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))

      DC24 effects

      • We are seeing overloaded I/O issues on our MD3460 storage nodes during the data challenges
      • Pools are reduced to 100 movers/pool

       

      Rebuilding existing UC and IU workers and storage as AlmaLinux 9

      • Missing openldap-compat on one of the EL9 rebuilt workers was causing job errors

       

      CVMFS Varnish issues starting January 26

      • Removed our varnishes from the CVMFS configuration on all workers
      • Still to be understood

       

      Setting up kafka and wlcgConverter for our dCache

      IU Brocade to be replaced in the coming weeks/month

    • 11:30 11:40
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    • 11:40 11:50
      SWT2 10m
      Speakers: Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (University of Texas at Arlington (US))

      SWT2_CPB -

      • Site not full for a couple of days. We suspect: 1) lots of single-core production; 2) file transfer backlog => DC24, FTS, Google traffic ??
      • Still trying to get UPS upgrade work performed...
      • De-bugging hardware problems on a few compute nodes - can be tedious
      • Have setup both alma & rocky 9 instances. No strong preference here. Full cluster migration will come later (but in time).
      • Prior to past couple of days, mostly smooth running

       

      OU:

      • Mostly running well over last few weeks.
      • Slate-Squid ready for production, but having network issue between CERN squid monitor and OU slate node, working on that.
      • DC24 overloaded some xrootd storage servers, have to periodically restart xrootd daemons on those.
      • CE already on EL9, will upgrade OSG from 3.6 to 23. Slate-Squid is also on EL9 already. New SE will be installed with EL9 and OSG 23 when we receive it next month. OSCER compute nodes will be upgraded to EL9 later this spring.