US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2024-01-24T10:00:00-05:00
End: 2024-01-24T12:00:00-05:00
Location: No location set

Wednesday 24 Jan 2024, 10:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
Announcements will be sent to the usatlas-t2-l@lists.bnl.gov mailing list.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Top of the meeting discussion 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
  Reasonable running over the past month
  
  AGLT2 One draining but I forget why...
  
  MWT2 ran into trouble with dCach upgrade.
  
  NET2 had an extended downtime for networking issues.
  
  SWT2 various minor storage and networking issues.
  
  I am benchmarking (HEPScore23 & RHEL9) Genoa and Bergamo CPUs. Let me know if you want a particular model tested.
  
  Also even with a lot of hints from Lincoln, I am having trouble gettting a container to run HEPSpec06 for comparison. If someone knows how, please send me instructions.
  
  Please get your reporting in now if you have not.
- 11:10 → 11:20
  
  AGLT2 10m
  
  Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  6-Jan: Planned UM building power outage at 9am but generator was supposed to carry the room.
  Unexpected and unexplained overheating (related to cooling water also being shutdown)
  compounded by loss of access to the room (backup card access power failed).
  We manually did emergency shutdown storage servers. Cooling recovered by 3pm.
  3 of the storage servers had issues with BOSS storage on reboot (probably from solid state overheating) but RAID1 eventually recovered.
  Meeting scheduled with the facility team and LSA IT.
  
  11-Jan: Planned site shutdown at 10am for replacement of APC UPS breaker at UM. Wrong part was shipped.
  APC maintenance work done but will need another shutdown to replace breaker.
  Power resotred at 1pm. Took 4h to get back online, including 2 storage nodes with OS file system issues that had to be rebuilt
  Recovered by 5pm. Declared downtime ended at 6pm but didn't get jobs until 16h later due to Switcher proxy issue.
  
  18-Jan: ticket 164948 about file transfer failures. One of the 2 storage nodes could not access dCache metadata.
  Was a missing config step after rebuild from 11-Jan.
  
  Also: now testing almost functional Firefly (IPv6 flow marking) for dCache.
  
  purchase plan: https://docs.google.com/document/d/11wU_scbpFQInz8qkXo4LPnnTH_Jn3soos7IxF5lSdzU/edit?pli=1
  Currenting working on quotes with Dell.
- 11:20 → 11:30
  MWT2 10m
  
  Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))
  dCache upgrade
  
  We are currently running 9.2.9 with 9.2.10 on our webdav doors
  
  Token support enabled. Will respond to the GGUS ticket
  
  Ran into issues with the upgrade (see the mattermost discussion). Seemed to clear up after we applied postgres tunings from Fermilab, but it's still unclear which settings fixed it.
  
  Procurement plan
  
  Compute at IU and UIUC
  
  Storage at UC
  
  ARM also at UC (won't be Dell)
- 11:30 → 11:40
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Pilot Zombie
  - After running workers, a bug in the new pilot wrapper release prevented pilots from being terminated correctly, leading to a state where resources were still claimed but idle.
  - implementing the proper amount of tests before rolling the version into production could have prevented it; at least 2 production sites had the production affected - NET2 had its actual number of occupied cpus reduced to 1/5 for several days.
  - The problem seems to be well understood and fixed now. We are still testing.
  - The problem was not directly related to the use of Kubernetes.
  
  Storage
  - token support configured and tested. Waiting for confirmation from Petr.
  - dCache upgrade to the same version used at BNL, 9.2.6, shows good performance.
  - A load test was performed yesterday using files from BNL, saturating the link without showing a significant load in the dCache head node.
  - Version 9.2.10 was tested and showed the 10s+ delay on listing files, even with the option enable_seqscan=off in the db. We are back to stock configuration regarding this variable.
  
  Cluster
  - OKD upgraded to 4.14.
  - IPv6 is configured in the cluster. We just started testing it.
  - A small batch of Dell servers, model C6420, presents problems with being inserted into the cluster. The initial inspection is ok, but there is a problem when the bare metal is claimed by a node, showing a success status for provisioning but never receiving a proper phase and never becoming the designated node. We have 4 different models in the farm, this is the first one to present such a problem. The troubleshooting will be resumed now.
  
  Perfsonar:
  - new server with perfSonarv5 deployed
  - debugging registration of throughput and latency measurement interfaces
  
  Racking of new storage
  - ongoing
  
  procurements
  - We got two preliminary quotes from Cambridge/Dell for dual 9534 and dual 9754
  - Performed benchmarking of the configuration. dual 9534: 4997 HS23 (256T), dual 9754:8477 HS23 (512T)
  - 9754 is more efficiency and less expensive, but we are concerned with power dissipation.
  - Working with Cambridge and Dell on a decision.
- 11:40 → 11:50
  SWT2 10m
  
  Speakers: Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (University of Texas at Arlington (US))
  OU:
  
  Smooth running, other than the occasional storage transfer overload causing a few time outs.
  
  Added OU_OSCER_ATLAS_SLATE_SQUID to OSG Topology. Waiting for that to be migrated to CRIC, then reconfigure CRIC to start using it.
  
  Purchase Plan: Replacing current DTN se1.oscer.ou.edu with new hardware this month, which will then connect to the new ceph storage later this summer when we replace our aging T630 xrootd storage.
  
  SWT2_CPB -
  
  As of 1/19/24 a problem with WAN connectivity to LHCONE sties was resolved. In the course of troubleshooting we had met with campus networking personnel, and concluded that the issue was most likely upstream of the campus edge router. Dale Carder from ESnet was very helpful. He worked with our regional network provider (LEARN) to identify the source of the problem. (LEARN provides the link that carries our traffic from the campus to the location in Dallas where we peer with ESnet / LHCONE.) On Friday 1/19, LEARN found a problem with the path we were taking, and re-routed the traffic. The improvement was immediately obvious:
  https://my.es.net/lhcone/view/UTA?t=7d
  GGUS tickets 164790 & 164901 were closed
  
  Our procurement plans for FY24 include: (i) replacing ~3 PB of storage (mostly our oldest hardware, Dell MD3460's); (ii) replace some old network switches in the cluster; (iii) purchase a small number of compute nodes, as a start toward gradually retiring at least some fraction of our oldest WN's (Dell R410's).
  
  The re-fresh of the UPS in the data center, scheduled for 1/24/24, was unfortunately pushed back to 2/21/24 by Schneider / APC. In the interim we may go ahead and replace a failed power module in the unit. Discussing this option with our local vendor rep.

Choose timezone

US ATLAS Tier 2 Technical