US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
      • Stable running generally
        • MWT2 IU had troubles wtth a network upgrade.
        • There was rogue jobset yesterday with a huge memory leak.
      • Keep working on moving to EL9.
    • 11:10 11:20
      AGLT2 10m
      Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      1. have a ggus ticket about degrading squid service, one of the varnish server sl-um-es2  was not working, it is being fixed. 

      2 made progress with opensearch, in the process of migrating data from elasticsearch to opensearch on RHEL9

      3. still see cvmfs errors on work nodes, frequency is about 1-2 nodes/day, and normally cvmfs_config killall and probe can fix the problem

      4. finalized the UM Tier3/DOE funding hardware purchase (1 Dell R760xD2 Storage and 2 R6625 WN)

       

    • 11:20 11:30
      MWT2 10m
      Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))

      IU switch migration led to network issues IU <-> UC and IU <-> CERN over the last week. Seems to be fixed now. Taking the time to rebuild all IU worker nodes as EL9.

      Large batch of jobs were OOMing workers at UC yesterday (Mar 5). Identified the task and submitted a report to DPD. Rod Walker halted the jobs and we're recovering workers at UC now. Taking the opportunity to rebuild machines that went down as EL9.

      Cooling event in a data center at UC brought down a few services that the AF and rebuilding nodes use. Recovered quickly, and now back to full functionality in that regard.

      Rebuilding workers and storage at UC as EL9 to meet the June 30th EoL.

      UIUC data center migration pushed back to a currently unknown date. We are inquiring about the status of EL9 for the nodes.

      DC24 pushed our older storage to almost max, requested some deletions of DC24 data to make sure production is not effected. Made adjustments to some of our older storage, as well, to handle the amount of data flow. Hoping to retire said storage with our next purchase.

    • 11:30 11:40
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Re-racking is almost complete: everything has been moved to the new racks, but some adjustments are necessary to finalize the setup (e.g., cables, etc.).
      I have managed to get the C6420 working with OKD. It is currently being re-racked.
      The installation of new storage is in progress. I anticipate it will be made available to me by NESE soon.
      One dCache pool had a hardware failure. NESE and DELL have restored it to a condition that allows data extraction. The data is currently being transferred to our allocated space in NESE's CEPTH; this process will take some time. Once the transfer is complete, the data will be made available to dCache. Hiro is helping with the steps to take in the event of data loss in the pool directory. ZFS is operational but in a degraded state, with 42 files reported as lost. A JIRA ticket will be opened if losses exceed 50,000 files.

    • 11:40 11:50
      SWT2 10m
      Speakers: Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (University of Texas at Arlington (US))

      SWT2_CPB -

      • Site not full for a couple of days. The site has been draining and refilling repeatedly. We suspect A possible change to the CVMFS wrapper, which is causing our compute nodes to fail jobs due to missing CVMFS mount. We are still trying to investigate this though.

      • We also suspect DC24 may have caused issues for our site prior to the CVMFS issue. 
      • Created script to modify NHC config file to check for missing mount and pull node out of Slurm in order to alleviate CVMFS issue for now.

      • UPS upgrade is scheduled for Thursday 3/7 (tomorrow).  

      • De-bugging hardware problems on a few compute nodes - can be tedious

      • Working on resolving perfSONAR issues.

      • Working on testing and implementing storage node token. 

      • Continuing to test Google HIMEM queues.

       

      OU:

      • Generally stable operations.
      • Occasionally deletion timeouts which are not understood. Some are caused by temporary storage overload, but others happen without load, so it appears that rucio deletion attempts of non-existing files take a long time to return 'file-not-found' the first time, and then successive identical requests are faster. Need to work with Andy and Wei to traci this down.