US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-06-25T10:00:00-04:00
End: 2025-06-25T11:00:00-04:00
Location: No location set

Wednesday 25 Jun 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  
  Introduction 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  
  Rafael_USTier2Meeting_06252025.pdf
- 10:10 → 10:20
  
  TW-FTT 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
- 10:20 → 10:30
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  EL9 at MSU status after Satellite and Capsule on V6.15:
  Remaining "known" problem with Host OS switched after a build, with workaround.
  (maybe regression as bug track claims that it was resolved for "Red Hat Satellite 6.8 for RHEL 7"
  while we seem to see exactly that behavior on "Red Hat Satellite 6.15 for RHEL 8".)
  More importantly: No new instance of obscure hidden and fatal record corruption.
  Currnetly focusing on deploying purchased equipment; storage and compute.
  
  dcache issues: updated the head nodes to the 10.2.13-scitag version caused Java memory leak issues. Rolled back to 10.2.12 and rebooting 5 times, eventually the memory leak stopped.
  
  Recently learned that there will be a disruption of street water at UM that will affect cooling.
  Currenlty planned to be overnight Thursday-Friday.
  Will need full downtime.
  Trying to understand if we can keep infrastructure up (power, networking, fan doors, VMware)
- 10:30 → 10:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Data disk briefly filled on 06/15/2025
  
  IU networking made a change to the IU network on June 17th that took us offline. Changes were reverted on June 18th.
  
  BGP tagging ticket was closed on 06/23.
  
  dcache upgrade to 10.2.13-1. Had our downtime on June 24, extended to 12pm on June 25th due to an storage node having hardware problems.
- 10:40 → 10:50
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  Number of slots reduced in the first week for electric performance measurement campaign.
  
  Operations were very smooth this week.
  
  We've been observing several jobs with low CPU consumption. Here is the summary copied from Ivan's news of the day:
  
  Derivation tasks reported by NET2 using only one core. (TaskID:45275683)
  
  7 from 8 procs seem to get stuck, or take a long time, so process 100 events while 1 process has 20k.
  
  Effectively serial. Manual serial run does not hang. SharedWriter issue?
  
  Some jobs are looping.
  
  Updated jira and asked to pause.
  
  No news on updates yet
  
  Closed ticket 3692 since transfers are no longer failing
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  For EL9 and storage, fixed CRIC setting for test cluster, cleared old jobs on test CE, and now waiting for new jobs to complete on worker nodes to test new storage. We are testing EL9 storage module in test cluster before deploying new hardware, then will test other EL9 modules.
  
  Finished testing partition layouts. We decided on two, one that overwrites all partitions (for empty storage systems only) and one that will preserve data partitions, allowing us to rebuild the OS of the storage system while preserving their data.
  
  Changed plans from using Zabbix for both alerting and displaying some information for monitoring to being strictly for alerting.
  
  GGUS tickets - Network Monitoring
  
  Continued to follow up with campus networking. They have partially granted us access to a grafana plot concerning the throughput of CPB from their end. After months of communication and follow ups, campus networking decided to communicate with campus network security concerning snmp read-only access to the appropriate switch port. It has been approved. They need to inspect our web server to assess its security. I have requested they do this as soon as they are able. Follow up questions are being asked as necessary to ensure we are not infringing on their security policies.
  
  Moving configuration away from EL7 monitoring server to a new EL9 server that will be used for this.
  
  Waiting for campus networking to implement snmp change.
  
  I am communicating with campus network security concerning other aspects of their security policy to implement this without infringing on it.
  
  Waiting for this to be completed first before working toward BGP tagging.
  
  Communicating with Dell sales representatives to extend warranty on thirteen of our storage systems. Eleven are R740s, two are Me4084s. Negotiations have finished and we are following through on their most recent quote.
  
  Communicated with Dell sales representative concerning new hardware for head nodes. We are finalizing this to purchase R450s with a configuration suitable for replacing our XRootD proxies and master node. These will also have higher network capabilities for when we improve our network infrastructure.
  
  Using a new server in the test cluster to test Varnish before implementing. Communicating with Ilija.
  
  Continuing test with changing parameters of both of our CEs. We have been communicating with OSG experts and the harvester team for deeper understanding and when discovering new bugs. Examples are:
  
  Condor-ce reconfig command not working as expected for changing max job limit.
  
  Reducing the max job parameter to 0 on one CE seemed to cause the whole site to drain and stop receiving new jobs. We discussed better alternatives for draining the CE using CRIC instead.
  
  We had a gradual increase in jobs cancelling from harvester's perspective. This seems to be caused by two issues: We tried to add IPv6 to our CE and we had ten problematic worker nodes that were in a strange state involving the puppet agent.
  
  We reverted the changes of IPv6, restarted the condor-ce services for revert to take effect, then noticed the cancelling job rate reduced significantly. We will test this in the test cluster first.
  
  We restarted the puppet agent on problematic worker nodes, which resolved the strange state they were in where the puppet agent timed out in running commands to get catalog from puppet master. We are investigating this.
  
  OU:
  
  Sorry, can't attend today
  
  Nothing to report, site running well