US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
      • MSU RH satellite for EL9: on track for first step of version upgrade 
            Yesterday V6.13 -> 6.14 on Satellite (by MSU IT)
            Today V6.13 -> 6.14 on Capsule (by us)
            If all goes well -> V6.15 on Thursday/Friday
            V6.15 will match current version at UM
            Current version 6.17 (released in May)
            Will pause updating at 6.15
      • Investigating changes to actual memory limits for cgroup killing jobs. 
            Currently killing at the job’s requested memory (seems unproductive)
            Have not found an effective way to change the RequestMemory on Condor-CE , so we created a user support ticket to OSG. 
      • Downtime (6/3/2025) at the UM site for ups work, finished earlier, and brought back all the services and work nodes on the same day. 
      • Completed all the work of Milestone 491
            "integrating all critical service monitoring and billing data into the AGLT2 opensearch platform and created dashboards"
            Now marked as finished
      • S3 data transfer ticket
          Service was being killed on OOM.
          Increased VM memory 16 -> 24 -> 48 G
          Also discovered issue with UM DNS on EL9 failing dnssec lookups of BNL sites (also NERSC, KIT,...).
          Partial solution was to add MSU DNS (with no dnssec) as backup server (as it should have been anyway).
          BNL resolved dnssec issue.
          But turning off dnssec anyway (for now)

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • Network outage June 5-6
      • BGP ticket ready to resolve (GGUS:2099)
      • To plan our dCache upgrade to 10.2 in the next few weeks
    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Operations:

      • OKD update studies ongoing
      • Finished PDU monitoring infrastructure. Two PDUs are deffective and may need replacement.
      • Review of server benchmarking ongoing
      • Finalize maintenance of R6625 (AMD 9754) servers. 
      • Started planning virtualization of border router.

       

      Events:

      [6/2/2025 - 6/5/2025] Dowtime due to data center shutdown for electrical maintanance. Tape required IBM maintenance after shutdown

      [6/11/2025] Small reduction in number of jobs for server benchmarking. 

       

       

       

       

       

       

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • In terms of new storage and EL9 migration, continuing to work with DDM ops to clear up remaining issues that need to be configured for the test cluster in order to perform additional tests for both EL9 migration and new storage. 

        • Set up storage dumps and provided requested info to DDM ops for test cluster setup in CRIC. 

        • We have been performing manual tests using commands and are now testing/improving migration scripts. 

        • We are continuing to adjust the storage module for improvements. Adjusting and testing partition configurations. 

        • Have additional modules created and are waiting to test with the test cluster. In the meantime, we review these and are making improvements where we can. 

      • Improved our internal monitoring to include both CEs. 

      • Created and are developing new internal alerting using Zabbix. Currently developing in the test cluster. 

      • Thanks to Ivan’s help, we switched from 16 cores back to 8 core jobs. Majority of our running slots have been filled with 8 core jobs recently. 

      • Set CE to remove completed jobs after ten days to prevent stuck jobs from causing issues as suggested by experts. 

      • Created alerts for jobs that are completed and have been on the CE for more than six days. 

      • Adjust and testing parameters for both of our CE with the consultation of the harvester team to better improve load balancing of CEs.   

      • Added additional nodes to the test cluster in order to test Varnish and other Puppet modules.

      • Gathering information and communicating with Dell to purchase new hardware to replace head nodes.  

      • GGUS tickets - Concerning BGP tagging and network monitoring, tried reaching out to campus networking for results of their discussions, but have not heard a response. Plan on contacting other members to schedule a meeting. 

      • Request to deploy IPv6 on CEs and WNs at WLCG sites - We have configured an IPv6 address on gk01 and test CE. HTCondor-CE service has to be restarted for it to take effect. Because of this, we will wait until gk10 is completely drained during our load balancing tests, configure IPv6, then restart the service. We do not see a reason to add IPv6 to WNs at this time, because they are isolated to an internal network.