US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Videoconference
US ATLAS Tier 2 Technical
Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
      • Running has been ok
        • Pile-up jobs have caused issues
        • Sites should report on what other issues they have encountered since the last meeting
      • Pleae provide any info you want me mention in the scrubbing slides.
      • Update your milestones and report on them during your site reports.
        • Report on when you will go to EL9 / OSG23.
      • Are there any purchases or retirements that will happen by June 30 that should be added to v68 of the capacity sheet?
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Felix.hung-te Lee (Academia Sinica (TW)), Han-Wei Yen
    • 10:20 10:30
      AGLT2 10m
      Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      EL9:
      - UM : finished all transition on  work nodes , will finish dcache pool nodes today.
      - MSU IT now on board for redhat capsule.  Working on deploying. Working on msu-specific ansible details in parallel.
      CVMFS:
      - new version coming : “I have a patch to avoid CVMFS blocking when the cache manager crashes in some very specific circumstances”
      - partition too small for workload: at UM site, all work nodes increase (double) size of partition, was 26G.  Increased size at UM to 46G.
      Data Loss:
      - procedural mistake during conversion of one dcache pool node to EL9 
      - we lost the pool file systems. JIRA ticket
      - all files with only one copy on that node were declared lost (0.5M files)
      - located copies cached on other pools and promoted them as primary
      - files deleted from dcache namespace

    • 10:30 10:40
      MWT2 10m
      Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))
      • EL 9 Upgrades:
        • 100% worker nodes
        • > 97% storage and head services.
      • IU had a downtime on Monday (06/24/2024) for network changes.
    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
      • cmvfs problems: we are testing Fernanodo's tool for k8s: the first run did not work due to permissions on OKD, but it is a solvable problem.
      • FY24 machines deployment: 7 machines deployed, 1 machine being used for test (will be integrated when done), 1 machine in maintenance
        • several small adjustments were need for the denser machines, final changes will committed when troubleshooting is done as most problems are intertwined.
      • Redundant core router: VRRP tested at 400g successfully. Starting the distribution to TOR switches with mc-lag.
    • 10:50 11:00
      SWT2 10m
      Speakers: Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • alma9 upgrade in progress
        • should be able to begin rolling upgrades of the WN's soon
      • expect to resolve a CRAC issue by tomorrow
      • we think we're making some progress on the job stage-out issue
      • once we're post-alma9 will begin planning for a HW purchase
        • we want to focus on storage and and new XRootD proxy DTN's
      • two students now working with us for the summer

       

      OU:

      • Today OSCER maintenance in preparation for EL9 upgrades.
      • EL9 tests are running well on OU_OSCER_ATLAS_TEST, so don't expect any problems during upgrade.
      • Not sure about the HC memory failures, though.
      • Got up to 100% additional opportunistic throughput (over 10,500 slots total) yesterday because of good back fill!