US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-05-14T10:00:00-04:00
End: 2025-05-14T11:00:00-04:00
Location: No location set

Wednesday 14 May 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  
  Rafael_USTier2Meeting_05142025.pdf
- 10:10 → 10:20
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  The network bandwidth connecting with LHCONE through ESNet has been upgraded from 3Gbps to 5Gbps on 1 May 2025. More than 3Gbps traffic observed in the first week of May.
  
  Job efficiency in recent week is better than the previous one. In general, site is running smoothly.
- 10:20 → 10:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  - 13-May updated dcache 9.2.27 -> 10.2.12
  Declared at-risk (intermittent) outage
  Smooth update
  It seems only up to 7 jobs may have failed on direct access (pilot error 1361)
  Part of motivation was to enable firefly monitoring.
  Success though content not as expected
  (expected latest changes didn't make it in this release?).
  
  - switching WNs to OSG24 (at UM)
  noticed some were not running boinc jobs; fixed scripts.
  
  - EL9 at MSU: continuing to investigate node record corruption.
  Will likely ask for satellite software update (in next couple days).
  
  - 10-May Saturday generator test at MSU data center = qualified success
  smoother than expected (i.e. no temperature fluctuation)
  when dropping A-side (on UPS) or B-side power feed
  The control system maintained (UPS) or restored full power in the racks in fraction of a second
  when dropping both, first from UPS, then 2.5 MW generator within a minute
  with very minimal temperature fluctuation for all steps
  –> all the normally unattended steps were a full success
  last step: going off the generator was somehow difficult to engage
  then all AC units failed to restart automatically
  and took time and effort to coax back on
  all C6420s and most R6525s shut off on temperature alarm
- 10:30 → 10:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Prepared our draft procurement plan
  
  Working with UIUC for BGP tagging (GGUS ticket #168404)
  
  Large number of small file transfers on 5/13/2025:
- 10:40 → 10:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Some discussion about the current downtime
  
  We were upgrading our routing to evolve our router infrastructure so that it can operate with SENSE. Part of this work is to segregate the several different routing services (LHCONE, non-LHCONE, FABRIC, SENSE, ...) in different virtual routers and, in the process, ensure that only LHCONE communication goes through the L3VPN. We thought these changes would be transparent but it turn out not to be the case. Non-LHCONE routing to NESE has stopped and this is preventing transfers to go through (and really begs the question why do we have to deal with requests outside of LHCONE). Jessa is working to fix it, but it will take another O(day) [estimate].
  
  These improvements being pursued will also help us to reduce the asymmetric ASN paths that were flagged last week. On the other hand, this also means that several NET2_MCTAPE transfer requests will timeout, but that's expected.
  
  We are working on preparing for the OKD upgrade. The first attempt was planned for this week, but we must postpone it now.
  
  We are progressing on the understanding of the tape system from our experience with Fabio's tests.
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Operations
  
  We recently migrated certificate management from the EL7 admin to the new EL9 admin for our XRootD proxy servers.
  
  After the certificate management change, we ran into an issue with updating CRLs on 5/5, causing us to drain. We adjusted the cron that runs on the EL9 admin node, fixing the issue.
  
  We drained on 5/11 due to many short jobs sent to sites, leading to all sites draining with not enough jobs to stay full.
  
  We noticed a lot of bad jobs on our CE on 5/13 that eventually cleared up. We noticed we completed over 25K jobs within the last twelve hours during this, despite only being half full. We may have been receiving a lot of short jobs, leading us to not stay full. We are still investigating this. There are a high number of exiting pilots and the issue may be with our CE. The status change seems to be taking some time.
  
  We replaced the expired certificate on an XRootD proxy that expired at the beginning of this month. Monitored and had no issues.
  
  Slurm and CE Monitoring
  
  We provisioned a new EL9 monitoring server, set up Grafana, and set up a monitoring dashboard of our Slurm server. We received help from MWT2 concerning monitoring tools for our CE. We will be working on this very soon.
  
  We are researching and planning on implementing Zabbix for our new alert system. We have a student assisting us in researching this tool. We plan on configuring and testing this very soon.
  
  EL9 Migration
  
  Our new test cluster CE has been added to CRIC for the SWT2_CPB_TEST Panda Queue and is receiving test jobs. The jobs are failing likely due to our test cluster XRootD proxy not being added to CRIC. It appears to be trying to use the production cluster’s XRootD proxy addresses (which is ).
  
  We are actively communicating with DDM Ops for assistance with adding this into CRIC. We are also asking questions to be careful in how we approach this.
  
  We created the monitor server module and are creating the XRootD redirector module.
  
  We have a list of minor improvements to make to the EL9 test cluster. We are going to make these changes soon.
  
  Once new storage has been implemented (or is in the early stages of being implemented), we will convert our XRootD proxies in the production cluster from EL7 to EL9.
  
  New Storage
  
  We have tested transferring and pulling files from our new storage running EL9 locally in the test cluster. We ran other commands to test, and it appears to be working.
  
  We also tested the new EL9 storage in the test cluster externally using an lxplus server. We ran different commands to test transferring files and other actions using different protocols. It is working.
  
  We plan on testing this for one week more thoroughly (hopefully with test jobs putting disposable data on the new storage), rebuilding, retesting, then start building new storage as EL9 next week in the production cluster.
  
  GGUS Ticket - GoeGrid Transfer Failures
  
  Issue has been resolved through a change on our side.
  
  GoeGrid is not using LHCONE. Because of this, we are not able to route traffic through ESNet to reach.
  
  We communicated with UTA campus networking and ESNet experts to coordinate and make changes with how the network is routing traffic. Campus networking is now routing this traffic over commercial provider instead of ESNet. This resolved the issue.
  
  GGUS Ticket - Enable Network Monitoring
  
  We have not heard back from campus networking on the last follow-up. We will follow up.
  
  OU:
  
  Scheduled maintenance today (file system and network upgrades).
  
  This should fix the job failures we've been seeing the last few days.

Choose timezone

US ATLAS Tier 2 Technical