US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2024-02-21T11:00:00-05:00
End: 2024-02-21T12:00:00-05:00
Location: No location set

Wednesday 21 Feb 2024, 11:00 → 12:00 US/Eastern

Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Top of the meeting discussion 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
  DC24 has affected all sites determentally.
  
  Two FTS outages in the last week.
  
  Please discuss how DC24 has affected your site
  
  Could each site please discuss their plans and status for the EL9 migration.
  
  cvmfs troubles?
- 11:10 → 11:20
  
  AGLT2 10m
  
  Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  No major problem.
  No problem with DC24 (so far).
  
  Starting 01/29/2024 we noticed a sharp increase of cvmfs failing on worker nodes.
  For a large fraction of issues ‘cvmfs_config probe’ would hang while checking the OSG section.
  It often took one or two cvmfs restart (killall) attempts to recover the node.
  On several occasion a reboot was needed to recover. This was most pronounced the first week,
  and gradually quieted down over the next 2 weeks. No issue noticed recently.
  
  Two tickets about SLATE squid instance problems.
  First ticket was sl-um-es2 instance hung.
  Second ticket followed an update of the container for security update that may not have been succefully deployed.
  Seemed to have had the wrong image tag (testing) for sl-um-es3 ?
  
  dCache xrootd monitoring: configured but no reportsgoing to Kafka yet.
  Waiting til end of DC24 to restart doors.
  
  We usually pool our MSUT2, UMT2 and UMT3 purchase to maximize discount.
  But UM T3 has to spend DOE money now.
  Selected: Compute R6625 with 32C AMD 9354 and 24x16G DIMMs (128 HT, 3G/HT).
  Storage: Selected R760xd2 with 24x20T drives Intel CPU (1x Silver 4510 2.4G 12C) and 8x16G DIMMs.
  
  EL9: both UM and MSU have RedHat site licenses.
  UM has been able to provision RHEL9 nodes from RH Satellite. Not MSU.
  Lots of work ahead.
- 11:20 → 11:30
  MWT2 10m
  
  Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))
  DC24 effects
  
  We are seeing overloaded I/O issues on our MD3460 storage nodes during the data challenges
  
  Pools are reduced to 100 movers/pool
  
  Rebuilding existing UC and IU workers and storage as AlmaLinux 9
  
  Missing openldap-compat on one of the EL9 rebuilt workers was causing job errors
  
  CVMFS Varnish issues starting January 26
  
  Removed our varnishes from the CVMFS configuration on all workers
  
  Still to be understood
  
  Setting up kafka and wlcgConverter for our dCache
  
  IU Brocade to be replaced in the coming weeks/month
- 11:30 → 11:40
  
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
- 11:40 → 11:50
  SWT2 10m
  
  Speakers: Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (University of Texas at Arlington (US))
  SWT2_CPB -
  
  Site not full for a couple of days. We suspect: 1) lots of single-core production; 2) file transfer backlog => DC24, FTS, Google traffic ??
  
  Still trying to get UPS upgrade work performed...
  
  De-bugging hardware problems on a few compute nodes - can be tedious
  
  Have setup both alma & rocky 9 instances. No strong preference here. Full cluster migration will come later (but in time).
  
  Prior to past couple of days, mostly smooth running
  
  OU:
  
  Mostly running well over last few weeks.
  
  Slate-Squid ready for production, but having network issue between CERN squid monitor and OU slate node, working on that.
  
  DC24 overloaded some xrootd storage servers, have to periodically restart xrootd daemons on those.
  
  CE already on EL9, will upgrade OSG from 3.6 to 23. Slate-Squid is also on EL9 already. New SE will be installed with EL9 and OSG 23 when we receive it next month. OSCER compute nodes will be upgraded to EL9 later this spring.

Choose timezone

US ATLAS Tier 2 Technical