US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2024-05-01T11:00:00-04:00
End: 2024-05-01T12:20:00-04:00
Location: No location set

Wednesday 1 May 2024, 11:00 → 12:20 US/Eastern

Alexei Klimentov (Brookhaven National Laboratory (US)), Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 11:00 → 11:10
  Top of the meeting discussion 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
  Does anyone object to starting future meetings 1 hour earlier?
  
  Good running until token change and problematic Panda code change to setterupper.
  
  Still recovering from those problems.
  
  Need to work to insure that the US Sites are informed of central changes.
  
  The US sites did not get tickets notifying us of the token changes while WLCG sites did.
  
  Keep pushing on EL9 and OSG 23.
- 11:10 → 11:20
  
  TW-FTT 10m
  
  Speakers: Eric.han-wei Yen (Academia Sinica (TW)), Felix.hung-te Lee (Academia Sinica (TW))
- 11:20 → 11:30
  
  AGLT2 10m
  
  Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  Today: site downtime for replacement of main breaker in whole-room UPS at UM.
  3rd attempt. Under way.
  
  EL9: 17 UM worker nodes now on RHEL9
  
  SOC: Milestone completed.
  MSU capture node operational.
  But throughput seen is too low.
  Investigating.
  
  New IAM instance: implemented in dCache.
  Test page was showing problems.
  Petr updated test scripts; now all green.
- 11:30 → 11:40
  MWT2 10m
  
  Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))
  Operations
  
  SNMP monitoring updated for the new IU Aristas
  
  A couple of brief outages earlier this week due to hypervisors rebooting
  
  Updated our dCache gPlazma and xrootd domains for the new token issue
  
  Partially drained since the 29th due to various upstream issues
  
  EL9 Upgrade Status
  
  UIUC datacenter move and RHEL9 upgrade pushed back until January 2025 due to delays in obtaining new switches
  
  All UC and IU workers upgraded to AlmaLinux 9
  
  ~75% of the UC storage has been upgraded to AlmaLinux 9
  
  IU management hypervisors to be upgraded to AlmaLinux 9 next week, followed by upgrading IU to OSG23
  
  UC management hypervisors currently are being upgraded to AlmaLinux 9
- 11:40 → 11:50
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  Operations highlights
  
  04/20/2024: we were still seeing the CVMFs problems from the week before. We decide to redeploy the CVMFs daemonset in the cluster, bringing the efficiency to normal levels, although still somewhat turbulent.
  
  04/27/2024: We redeployed the CVMFs daemonset again, using the new CVMFs release image made available on the CERN registry (on 04/22/2024), improving the cluster efficiency even further.
  
  04/30/2024: we restarted our dCache headnode, aiming to fully apply the new token configuration. We are passing now in both the old and new token configurations, for webdav and xrootd.
  
  Development
  
  Working on Prometheus token-less access for OSG.
  
  Deployment
  
  04/22/2024: 7 new storage servers deployed. NET2 total storage avaialble: 11.4PB (useful)
  
  Our dCache cluster is based on Alma9 since 08/27/2023.
  
  The image used by the pilots is Alma9 based for all Kubernetes sites since 02/19/2024 (NET2 was using it intermittently, helping to prepare that image since November 2023)
  
  FY24 compute machines have been racked. Network configuration is ongoing.
- 11:50 → 12:00
  SWT2 10m
  
  Speakers: Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  CPB:
  
  Trying to understand occasional HC off-lining events - seems to be related to storage direct I/O.
  
  Testing a fourth DTN - expect to bring it online soon.
  
  CVMFS pilot wrapper failures ("faults") have decreased.
  
  Added the SWT2_GOOGLE_ARM PanDA queue - not much running in the HIMEM queue recently.
  
  Meeting with campus networking folks this week to finalize the WLCG monitoring deployment.
  
  Student working on the LOCALGROUPDISK "atime" project.
  
  We have a few servers set up so far for testing. What would be our frontend is currently running Alma Linux 9 with Puppet and Foreman installed and running. We are working on understanding and testing these at the moment.
  
  OU:
  
  Running well, nothing to report.
  
  Still need to fix CRIC to start using OU's Slate_SQUID here:
  
  https://atlas-cric.cern.ch/core/experimentsite/detail/OU_OSCER_ATLAS/
  
  Google:
  
  Very successful with phase 1. Completely mopped up all stuck tasks with very high memory and on express queue within a few weeks. No more tasks left to do, but both queues will remain ON if some new work shows up.
  
  For Phase 2, Mario and others in the Rucio team are helping to set up a SE at Google. Right now Google queues use SWT2_CPB SE.
  
  For Phase 3, Fernando set up a ARM queue which quickly ramped up to 5k cores. Looks like more cost effective than Intel - it will be studied. However, HC blacklisted and shut down the queue suddenly. Kaushik will follow up with email on lessons learned.

Choose timezone

US ATLAS Tier 2 Technical