US ATLAS Tier 2 Technical

US/Eastern
Alexei Klimentov (Brookhaven National Laboratory (US)), Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
      • Does anyone object to starting future meetings 1 hour earlier?
      • Good running until token change and problematic Panda code change to setterupper.
        • Still recovering from those problems.
        • Need to work to insure that the US Sites are informed of central changes.
          • The US sites did not get tickets notifying us of the token changes while WLCG sites did.
      • Keep pushing on EL9 and OSG 23.

       

    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric.han-wei Yen (Academia Sinica (TW)), Felix.hung-te Lee (Academia Sinica (TW))
    • 11:20 11:30
      AGLT2 10m
      Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Today: site downtime for replacement of main breaker in whole-room UPS at UM.
      3rd attempt.  Under way.

      EL9: 17 UM worker nodes now on RHEL9

      SOC: Milestone completed.
      MSU capture node operational.
      But throughput seen is too low.
      Investigating.

      New IAM instance: implemented in dCache.
      Test page was showing problems.
      Petr updated test scripts; now all green.

    • 11:30 11:40
      MWT2 10m
      Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))

      Operations

      • SNMP monitoring updated for the new IU Aristas
      • A couple of brief outages earlier this week due to hypervisors rebooting
      • Updated our dCache gPlazma and xrootd domains for the new token issue
      • Partially drained since the 29th due to various upstream issues

      EL9 Upgrade Status

      • UIUC datacenter move and RHEL9 upgrade pushed back until January 2025 due to delays in obtaining new switches
      • All UC and IU workers upgraded to AlmaLinux 9
      • ~75% of the UC storage has been upgraded to AlmaLinux 9
      • IU management hypervisors to be upgraded to AlmaLinux 9 next week, followed by upgrading IU to OSG23
      • UC management hypervisors currently are being upgraded to AlmaLinux 9
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Operations highlights

      • 04/20/2024: we were still seeing the CVMFs problems from the week before. We decide to redeploy the CVMFs daemonset in the cluster, bringing the efficiency to normal levels, although still somewhat turbulent. 
      • 04/27/2024: We redeployed the CVMFs daemonset again, using the new CVMFs release image made available on the CERN registry (on 04/22/2024), improving the cluster efficiency even further.
      • 04/30/2024: we restarted our dCache headnode, aiming to fully apply the new token configuration. We are passing now in both the old and new token configurations, for webdav and xrootd.

       

      Development

      • Working on Prometheus token-less access for OSG.

      Deployment

      • 04/22/2024: 7 new storage servers deployed. NET2 total storage avaialble: 11.4PB (useful)
      • Our dCache cluster is based on Alma9 since 08/27/2023.
      • The image used by the pilots is Alma9 based for all Kubernetes sites since 02/19/2024 (NET2 was using it intermittently, helping to prepare that image since November 2023)
      • FY24 compute machines have been racked. Network configuration is ongoing.
    • 11:50 12:00
      SWT2 10m
      Speakers: Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      CPB: 

      • Trying to understand occasional HC off-lining events - seems to be related to storage direct I/O.

      • Testing a fourth DTN - expect to bring it online soon.

      • CVMFS pilot wrapper failures ("faults") have decreased.

      • Added the SWT2_GOOGLE_ARM PanDA queue - not much running in the HIMEM queue recently.

      • Meeting with campus networking folks this week to finalize the WLCG monitoring deployment.

      • Student working on the LOCALGROUPDISK "atime" project.

      • We have a few servers set up so far for testing. What would be our frontend is currently running Alma Linux 9 with Puppet and Foreman installed and running. We are working on understanding and testing these at the moment.

       

      OU:

      • Running well, nothing to report.
      • Still need to fix CRIC to start using OU's Slate_SQUID here:
      • https://atlas-cric.cern.ch/core/experimentsite/detail/OU_OSCER_ATLAS/

       

      Google:

      • Very successful with phase 1. Completely mopped up all stuck tasks with very high memory and on express queue within a few weeks. No more tasks left to do, but both queues will remain ON if some new work shows up.
      • For Phase 2, Mario and others in the Rucio team are helping to set up a SE at Google. Right now Google queues use SWT2_CPB SE.
      • For Phase 3, Fernando set up a ARM queue which quickly ramped up to 5k cores. Looks like more cost effective than Intel - it will be studied. However, HC blacklisted and shut down the queue suddenly. Kaushik will follow up with email on lessons learned.