US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))

      1. After the site recovered from shutdown on 26 Sep, the site is running smoothly.

      2. Issues and tickets were resolved by reconfigurations after downtime during 23 sep and 26. 

    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
      • smooth running over last 2 weeks; no ticket; no incident
      • found a few work nodes with cvmfs issues which need manual fix
      • recent jobs (both ATLAS and OSG) have lower CPU Efficiencies (most of work nodes with all job slots claimed, but only ~50% user CPU), to address this:
        • identify nodes with stale BOINC jobs and abort stale jobs, so BOINC can use the idle CPU
        • change the policy to allow BOINC to use different amount CPU cores depending on the user cpu of the system. 

       

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant

      Testing the new pilot version (3.10.6.27). Seems to be working as expected now. Will start increasing MeanRSS next week

      Setting up the Prometheus CVMFS exporter to monitor CVMFS client traffic

      Partially drained for system updates last week

      IU UPS maintenance on 10/02/2025

    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Storage issues over the past week or so.  Dcache needs to be rebalanced, a problem that is aggravated by tape transfers, which can fill up pools if slow transfers don't allow space to free up fast enough.  We also had to drain a few pools that were having kernel panics.  All of this means occasional blacklistings due to failed transfers, generally resolved quite rapidly.  We are working on it and hopefully it will be more stable soon.

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • Continuing migration

        • Completed the migration of data from three MD3460 storage arrays to new R760xd2 storage. The first two MD3460s are now retired but still being used for testing purposes. 

        • Ran into issues with space reporting, contacted XRootD experts for assistance, and fixed the issue using the frm_admin utility. 

        • Met with XRootD experts to discuss improving the speed of our migration script. The information they provided was very informative. We are testing different options and researching this further. 

          • We attempted to use the streams option for xrdcp, but it was confirmed that the version of XRootD (5.6.9) on EL7 that we are using has a bug where this option does not work. We are working toward enabling this option to perform tests and determine any improvements in speed. So far, this seems to be an issue with the newer 5.8.4 EL9 issue as well, but we are researching this. 

          • We also ran multiple instances of the migration script in parallel, where each has a different set of files to transfer and unique migration IDs. 

          • We added email notifications for when the script ends so we can investigate why it stopped sooner. This makes it easier to be aware of its status and save time. 

        • We successfully rebuilt one of the retired MD3460 storage arrays from EL7 to EL9 while preserving data. We are still testing and checking this closely for safety. If this works consistently and reliably, we may be able to perform this with other storage to speed up the process. 

      • IPv6

        • Implemented IPv6 on the public interfaces of both of our Condor-CE nodes. We restarted the Condor-CE service on both nodes in order to pick up this change. 

        • Disabled the public interface on our Slurm server, as it is no longer needed. 

      • EL9 

        • We are continuing to develop the EL9 Puppet module for our XRootD Proxy and XRootD Redirector nodes as time permits. The modules have been created, but they still need to be improved and tested for production. 

      • Job Failures and Partial Drain

        • Several hundred jobs failed due to SIGTERM (Example). The initial time limit for running jobs was set to forty-eight hours in CRIC but has been gradually increased to seventy-two hours. 

        • We experienced partial drain on 9/22 due to issues with our local DNS server. We experienced unexpected behavior when adding new records for EL7 nodes. Attempts to resolve the issue caused the loss of EL9 DNS records, but these were rebuilt shortly afterward. 

      • Failed Monthly Inventory Dump

        • We encountered an issue with generating a monthly inventory dump. The problem was caused by old (and removed) storage in the list of storage nodes. We plan on updating the inventory dump script to resolve this, but we need confirmation from Fabio concerning certain details. 

      • Misc.

        • We have been working with a vendor, under warranty, on hardware to replace failed drives and resolve issues with WNs a little more than average over the past two weeks. 

        • We significantly improved internal documentation on power draw, warranty, and worker node locations for better future planning and improvements. 

        • We are gathering more information and preparing for UPS preventative maintenance scheduled at 10:30 a.m. on 10/6. This is to assess any risk to operations and take necessary steps to prepare for it.

        • Resolved issue with our IPMI Puppet module that was causing issues for certain worker nodes issues when updating via Puppet agent. They are now updating properly.  

        • Improved backup of head nodes to include additional nodes. 

        • Created additional alerts for better visibility on significant events we should be aware of. 

      OU:

      • Nothing to report, site running well, including good opportunistic throughput
      • Haven't seen any more D cvmfs processes last week, but still don't understand how they happen if they do
      • Token support for se1.oscer.ou.edu is working for https/davs. It will also be enabled for root/roots once we switch over to the new storage. As far as we are aware, nothing is currently using root/roots token support, so I think we can close this ticket again.
      • Dual stack is working for storage and internally, and we're working on enabling it for grid1 (CE) and slate01 (Squid).