US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Top of the meeting discussion 10m
      Speakers: Fred Luehring (Indiana University (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
      • Please provide your quarterly reporting by the end of the day on Monday, April 22.
        • This will give the various levels of management time write their own reports with your information in mind.
      • Please check all of the milestones affecting your site on this quarter's list: https://docs.google.com/spreadsheets/d/1DsMH-16v7bJy6qEkvTEWdLCkfeAXC6VpL019rUpcET8
        • Let me know ASAP about any changes to your milestones. Any entries list in bold on the spreadsheet are changes that are already under discussion.
        • Both CentOS 7 and OSG 3.6 go EOL on Jun 30, so please plan accordingly.
      • Also check that your site info as of Mar 31 on v67 Mar 24 tab of the capacity sheet: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
      • And update your site info as of Mar 31 on the services sheet: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
        • NB: Answer yes to the IPv6 question now include having all of the compute dual stacked.
      • We will start with some disussion about the Taiwan Tier 2 which has joined the US Cloud.
        • Could we change the meeting starting time to 9 am CDT / 10 am EDT? This will be easier on the Taiwan team as it will be 10 pm China Standard Time.
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric.Han-Wei Yen (Academia Sinica (TW)), Felix.hung-te Lee (Academia Sinica (TW)), Ofer Rind (Brookhaven National Laboratory)
    • 11:20 11:30
      AGLT2 10m
      Speakers: Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      MSU is having problems with the 2 mellanox NICs for network capture preventing progress on SOC at MSU.
      The ticket with NVidia was escalated over the weekend. 
      We are currently scheduling  an interactive session for this afternoon.
      Conclusion: the MSU node is not ready but we will hopefully know more about timeline today or tomorrow.

      There has been more cvmfs hiccups again since yesterday (16-Apr)

      Downtime scheduled for May 1st for next attempt at replacing UM room UPS breaker.

    • 11:30 11:40
      MWT2 10m
      Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US))
      • 75% done with Alma9 rebuilds.
      • Had a GGUS ticket for a degraded squid. This happened due to a network issue on the server that hosts the service. The load was picked up by another endpoint.
      • We were partially drained on April 12th due to our MD3460s being overloaded, but we recovered.
      • We plan to get quotes for our yearly purchase soon.
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
      • Stable operations throughout the week.
      • Network for new NESE servers (~3.1PB) completed. We are taking the chance we have with the new servers to study the performance of new filesystems beyond ZFS. These studies will be important for the long-term operation of the site.
      • DWDM for NEREN connection installed. Waveservers being ordered. We are working with Ciena engineers to settle final details.
      • Upgrade of the internal NET2 - NESE connection to 400 Gb/s delayed until next week (to allow for operations during the week).
      • Ticket about squid server: squid server is available at NET2, just not seen by monitoring service. A failover server, accessible to the monitoring service, is planned to be installed next week.
      • Investigating issues with network provider inside OKD clusters. Since last OKD update, some servers get their CPU overcommitted.
    • 11:50 12:00
      SWT2 10m
      Speakers: Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      OU:

      • Site running well, no major issues
      • Slate-Squid in production locally, just need to figure out how to update CRIC such that it will be used as primary squid rather that UTA ones.
      • OSCER EL9 upgrade plans are on track for end of June

      CPB: 

      • Testing LSM replacement - ran into an issue during last week's switch - will try again early next week with a modified configuration
      • Currently setting up and testing tokens for the storage DTN's - completed set up of test DTN for testing purposes before implementation in Tier 2
      • Evaluating new build system / RHEL9 in the test cluster
      • Debugging a couple of recent HC offlining events (only affects analysis-functional-test-jobs) - may be problematic WN's ?
      • CSE student working with us on a program to provide atime for LOCALGROUPDISK (requested by analysis support team)
      • Troubleshooting WN hardware issues