US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2024-08-21T10:00:00-04:00
End: 2024-08-21T11:00:00-04:00
Location: No location set

Wednesday 21 Aug 2024, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

67453565657

Fred Luehring

Join via phone

- 10:00 → 10:10
  Top of the meeting discussion 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
  Received quotes from Dell at both IU and UM.
  
  View the quotes at: https://drive.google.com/drive/folders/1LqlkGpt_jr-HpVQkylPOhxNbOze63DYe
  
  The bottom line is:
  
  R760xd2 storage with 20 TB disks: UM $47.75/TB, IU $48.35/TB. (Disk prices rose between the UM and IU quotes.)
  
  R6625 with 2 x AMD 9354 64C/128T per server UM $4.31/HS (2 PSU and more memory) / IU $4.17/HS (1 PSU and less memory)
  
  The 4th (Genoa) and 5th generation (Bergamo) AMD processors has 12 memory channels instead of the previous 8 which mean you need to buy multiple 12 RDIMMs to obtain maximum performance.
  
  C6525 with 2 x AMD 7443 CPUs (3rd generation): $4.25/HS (The newer AMD 4th and 5th generation processors are too hot for this type of a server)
- 10:10 → 10:20
  
  TW-FTT 10m
  
  Speakers: Felix.hung-te Lee (Academia Sinica (TW)), Han-Wei Yen
- 10:20 → 10:30
  
  AGLT2 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  GGUS ticket 167854,
  
  100% transfer failure between AGLT2 and PIC, we investigated, 5% of failed files are lost files, we declared loss in rucio, the other 95% files do not exist in rucio.
  
  keep tuning HTCondor, to increase cluster occupancy, goal is 99.5%.
  
  procurement: both MSU and UM have order placed on 8/16, expecting to receive in a couple of weeks
  
  EL9: MSU is working on building the capsule server to talk to the satellite server.
- 10:30 → 10:40
  MWT2 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  Mostly running smoothly
  
  Drained Aug 9 from the mass blacklisting event
  
  Working on finalizing our purchases
  
  UC: storage to retire the MD3460s (6-7 PB)
  
  IU: storage (~5 PB), ARM compute?
  
  UIUC: compute only
- 10:40 → 10:50
  NET2 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  CVMFS:
  
  Currently using an older version, 2.11.0-1.
  
  The site has remained stable with this version so far.
  
  Using a large cache of 100GB but considering reducing it to 50GB.
  
  FY24 Procurement Plans: under discussion.
- 10:50 → 11:00
  SWT2 10m
  
  Speakers: Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Alma9 -
  
  Running test jobs on compute nodes using old Slurm server is problematic. Slurm is only compatible with other Slurm versions two major versions behind or less.
  
  Continuing to make adjustments and improvements to the compute node appliance.
  
  Working on completing and testing Slurm master node appliance in test Puppet cluster before deploying in SWT2 cluster.
  
  Planning on running additional tests once the Slurm appliance is created before deciding on deployment method.
  
  "Rolling / live" or take a downtime?
  
  slurm 'Kill Task Failed' WN exclusions -
  
  Rate of exclusions has improved significantly since last Friday.
  
  Error rates correlated with Kill Task Failed and nodes being removed from Slurm due to this reason has become minimal (only a few nodes over the past weekend).
  
  We have been in contact with others concerning issues with jobs which may have been resolved now.
  
  Planning next purchase in light of recent guidance - looking over quotes.
  
  Ongoing work to setup a RSE for the Google PanDA queues
  
  Investigating ggus 167935 (inbound trasfers) - related to tokens testing.
  
  OU:
  
  Running well, no major issues.
  
  Question about why last month's CPU efficiency at OU is reported as 0 in CRIC still needs to be addressed, though.
  
  Will use most of FY24 funding to replace the 700 TB storage which will go out of warranty later this fall.

Choose timezone

US ATLAS Tier 2 Technical