US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-09-03T11:00:00-04:00
End: 2025-09-03T12:00:00-04:00
Location: No location set

Wednesday 3 Sept 2025, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  
  Rafael_USTier2Meeting_090325.pdf
- 11:10 → 11:20
  
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  1) finished UPS battery replacement in the Tier2 room of the UM site.
  
  2) updated cvmfs to the most recent version for the UM site due to seeing more problems than the MSU nodes.
  
  3) MSU fixed the network traffic reporting problem. Data center router replacement caused changed SNMP indices. Still trying to understand why data challenge traffic was only traversing one of the two data center routers.
  
  4) sporadic nodes with cvmfs issues which need to be manually fixed, most of them happen on the nodes with squids as the http_proxy
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Discussing older compute node (R620 and R630s) retirements
  
  Finished migrating the Elasticsearch data to RAID config since JBOD is deprecated now
  
  More testing and collecting power consumption data for compute nodes
  
  IU UPS maintenance is scheduled for 09/17/2025
- 11:40 → 11:50
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  The cluster is almost fully back on line following the upgrade. We planned to update the firmware of all servers during the downtime, which may have been a mistake as there were difficulties updating the firmware of the older servers. Also, we had a problem with mac address flapping on an old switch which was interfering with pxe booting, if we can't find the origin of the problem we may have to replace that switch. So the process went more slowly than we were hoping, but it is now almost complete. The remaining servers (those without enterprise IDRAC licenses) will have their firmware updated today.
  
  The cluster was briefly blacklisted last night after cvmfs failed to start correctly when a node was restarted.
  
  Ticket 1000422 can be closed, the old squid just had to be removed from the CRIC configuration (thanks Ivan).
  
  We had a goot run on the data challange, but we are waiting to hear from Hiro.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  New Storage Deployment
  
  Improving migration scripts have been completed, causing us to resume migration in the production cluster.
  
  We have been monitoring closely and holding discussions about resumed migration. So far, no data has been lost and the first migration process of one old storage has been completed. We are now discussing steps for speeding up the process with the next MD3460s to migrate from.
  
  It seems the main migration script is more robust, safe, and informative. We will continue to watch it closely and make any necessary improvements as we use it. We also developed ideas for future major improvements for the next time we perform data migrations.
  
  We found two broken files that are zero bytes in size on the source storage. One file has been restored by DDM (after being declared bad) and the second file is on scratchdisk, which is very old and will be removed.
  
  GGUS-Ticket-ID: #683657: Varnish
  
  We continued to coordinate with Ilija on remaining Frontier accesses to our squid which needs to be stopped.
  
  We checked squid monitoring and found the 5% access is due to CVMFS.
  
  As requested from Ilija to address the XML malformed bug discovered previously, we updated the Frontier Varnish version. This adjusted how quickly Varnish removed old objects, resolving the XML malformed issue.
  
  Due to not finding any additional nodes that were experiencing routing problems to our site, the ticket has now been marked resolved.
  
  Brief Reduce In Capacity
  
  The chilled water plant experienced power issues on 8/30 at 12:55 p.m. due to poor weather conditions (thunderstorms). This caused the data center to be very warm. We drained 20% of our WN, consistently of the oldest models, temporarily to help control the temperature and monitored closely. We contacted the chilled water plant for updates, opening a ticket concerning our issue, and it was resolved on 8/31. We put the drained WN back into service on 8/31 in the evening.
  
  An unscheduled warning downtime entry was created in CRIC since we were at risk of experiencing an outage.
  
  We experienced some unexpected issues with alerting pertaining to high temperature and are investigating this. Unfortunately, this affected our ability to be aware of the issue sooner.
  
  Failed Jobs
  
  We experienced over 600 jobs fail over the past week due to jobs hitting the maxtime parameter for jobs set in CRIC. This was adjusted to 49 hours, but we continued to see jobs hitting this limit and receiving a SIGTERM. This appears to be a central issue, as it affects multiple sites.
  
  OU:
  
  Site running well, no issues
  
  Draining some nodes to re-image, to force them to upgrade to latest cvmfs version
  
  xrootd token support works for http, but not xrootd yet. Investigating

Choose timezone

US ATLAS Tier 2 Technical