US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))

      Some discussions during the meeting:

       

      • TW-FTT: Updated CRL and Fred reported no errors since then. 150TB in the last 7 days without errors.
      • AGLT2: Bug at RH satellite @ MSU. Clear evidence of bug in the installations. Plans to use v6.15, as used in UofM, but that will only happen on June 10th.
      • AGLT2: Trying to understand if dCache 10 can be patched to have production-version fireflies enabled. Otherwise, only in v11.
      • OU_OSCER: Recurrent problems with storage overload will only be fixed when new CEPH storage is online. Recent new DTN improved situation a bit, but transfers limited to 4GB/s
      • CPB: Only site 16-cores jobs. Ivan says that those are favored by CMS. Queue should be filled by 8-core jobs if 16-cores are not available.
      • CPB: Still configuring test cluster in CRIC.. Backup CE configured, but not used in load-balancing mode to not mask problem. Main problem was not CE, but a CRIC configuration.
      • NET2: Studying services that are not IPv6-only ready. Eduardo says that a test cluster should be setup just for that. Otherwise no one will do it.

       

    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
      1. Scheduled maintenance of international network between TW-FTT (ASGC) and ESNet: 6am 28 May to 12pm 29 May (UTC+0).
      2. able to support 330+TB data transmission (inbound + outbound) in May 
    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      EL9 at MSU
        Satellite problems, freshly defined node usually builds 1st time and fails after
          (should copy user keys to root account, should create bonded interfaces, etc, but doesn't)
        Last week asked MSU IT to update RH Satellite from V6.13 to 6.15
          current version is 6.17 but will pause at 6.15 (used at UM)
        But currently MSU IT is in middle of "production patching", this week and next.
        Submitted change request to 6.14 on Tue 6/10, then check, then 6.15 on 6/12
        Will find ways to limp along and make progress in the meantime. 

      OpenSearch: improving condor/boinc/squid monitoring

      PDU problem: still looking for solution

      All EL9 worker nodes, condor head node and condor CEs now on OSG24

      Deleted dark data from aglt2datadisk (FTS test data and old monitoring info)

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • UIUC BGP tagging completed (GGUS ticket #168404)
      • limit to HIMEM jobs removed from MWT2 queue
    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Efforts to upgrade OKD are ongoing. Not all targeted issues have been resolved yet, and investigations are continuing.

      Tape operations are ongoing. Efforts are being made to improve throughput using the existing hardware, but it is difficult to predict total performance gains at this time.
      Additionally, there is a focused effort to enhance the Quality of Service (QoS), as the level acceptable for a Tier-2 site may not be sufficient for current needs.

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • Actively communicating with OSG experts and Panda team concerning a bug in Condor. Jobs are getting stuck in the Condor queue after completion. There are recent discussions on this potentially being an issue with harvester repeatedly losing contact with certain jobs (testing different setting for condor-ce-routing on test cluster). Also, the primary issue with our site not filling up properly with jobs was due to a CRIC setting, which Fred is assisting us with. 

      • Implemented second CE as backup and to allow for downtime maintenance of our main CE whenever needed. This second CE is operating at a 100 max job limit. 

      • Performed tests on our EL9 storage using a similar environment as the production cluster to the test module. 

      • Coordinating with DDM Ops to add second RSE and other components, so we can use our test cluster separately to simulate production when staging changes. They added the RSE Monday (5/26/2025), so we are close to having this completed. 

      • We have our internal monitoring for our Slurm and both CE’s now in place (thanks Judith for links and help). 

      • For EL9, developed other appliances, but waiting to test in the test cluster before implementing. 

      • Concerning GGUS ticket to enable network monitoring, I sent a follow up message to multiple members of campus networking last week. They said they would discuss it last week and move forward on our request. Waiting for a follow up. 

       

      OU:

      • Occasional storage overload, caused by either heavy I/O jobs or massive WAN transfers. Should subside again eventually.
      • Other than that, stable operations.